Launch HN: Cekura (YC F24) – Testing and monitoring for voice and chat AI agents

Hey HN – We are Tarush, Siddhant and Shashij from Sekura (https://www.cekura.ai). We’ve been running voice agent simulations for 1.5 years, and recently extended the same infrastructure to chat. Teams use Secura to simulate real user conversations, stress-test signals, and LLM behavior, and catch regressions before production goes live.The main problem: You can’t manually QA an AI agent. When you send a new prompt, swap out a model, or add a tool, how do you know that the agent still behaves correctly across the thousands of ways users can interact with it? Most teams resort to manual spot-checking (doesn’t scale), wait for users to complain (too late), or perform weakly scripted testing.

Our answer is simulation: synthetic users interact with your agent in the same way real users do, and an LLM-based judge evaluates whether he or she responded correctly – across the entire conversation arc, not just in one turn. Three things make it really work: Scenario creation + real conversation import – Our scenario creation agent bootstraps your test suite from your agent’s description. But real users find paths that a generator wouldn’t expect, so we also ingest your production conversations and automatically extract test cases from them. Your coverage evolves according to your users.

Mock Tool Platform – Agents call the tool. Running simulations against the real API is slow and flaky. Our mock tool platform lets you define tool schema, behavior, and return values ​​so that simulations can practice tool selection and decision making without touching production systems.

Deterministic, structured test cases – LLMs are stochastic. A CI test that passes “most of the time” is useless. Instead of free-form prompts, our raters are defined as structured conditional action trees: explicit conditions that trigger specific responses, with support for fixed messages when word-by-word accuracy matters. This means that the synthetic user behaves consistently across all runs – same branch arguments, same inputs – so the failure is a real regression, not noise.

Secura also monitors your live agent traffic. The obvious choice here is a tracing platform like Langfuse or Langsmith – and they are great tools for debugging individual LLM calls. But the mode of failure of conversational agents is different: the bug is not in any one turn, it is in how the turns relate to each other. Have a verification flow that requires name, date of birth, and phone number before proceeding – if the agent skips asking for date of birth and moves on anyway, each individual turn looks fine in isolation. Failures are visible only when you evaluate the entire session as a unit. Sekura is built around it from the ground up. Where tracing platforms evaluate turn-by-turn, Secura evaluates the entire session. Imagine a banking agent where the user fails verification in step 1, but the agent hallucinates and still proceeds. The turn-based evaluator looks at step 3 (address verification) and marks it green – the correct question was asked. Sekura’s judge viewed the full transcript and declared the session a failure because verification was never successful.

Try us at https://www.cekura.ai – 7-day free trial, no credit card required. Paid plans from $30/month.

We also posted a product video if you want to see it in action: https://www.youtube.com/watch?v=n8FFKv1-nMw. The first minute dives into quick onboarding – and if you want to jump straight to the results, skip to 8:40.

Want to know what the HN community is doing – how are you testing behavioral regressions in your agents? Which failure mode has caused you the most harm? Happy digging down!



<a href

Leave a Comment