
Researchers at Stanford, Nvidia, and Together AI have developed a new technology that can find new solutions to very complex problems. For example, they managed to optimize a key GPU kernel to run 2x faster than previous state-of-the-art written by human experts.
His technique, known as “test-time training to find(TTT-Discover), challenges the existing paradigm of letting models “think long” for logic problems. TTT-Discover allows the model to continue training during the inference process and update its weights for the current problem.
Limits of ‘frozen’ logic
Current enterprise AI strategies often depend on "frozen" Model. Whether you use closed or open logic models, the parameters of the model are constant. When you give these models signals, they look for answers within a certain range of their training data. This works well for problems that are similar to those previously seen by the model.
However, true discovery problems, such as inventing a new algorithm or proving a new mathematical theorem, are, by definition, out of the distribution. If the solution requires a leap of logic that is not present in the training set, a frozen model is likely to fail, no matter how many calculations you perform on it during inference.
In comments to VentureBeat, Mert Yuksekgonul, co-author of the paper and a doctoral student at Stanford, illustrated this difference using a well-known mathematical breakthrough:
"I believe that thinking models will not be able to prove, for example, that P! = NP, without trial-time training, just like Andrew Wiles would not have been able to prove Fermat’s Last Theorem unless he spent 7 years isolating this single problem and continuously learning from his failures."
TTT-Discover treats the testing problem not as a question to be answered, but as an environment to be mastered. As the model attempts to solve the problem, it generates different types of data: failures, partial successes, and errors. Rather than discarding this data, TTT-Discover uses it to update the model’s weights in real time, allowing the model to effectively focus on that specific challenge as opposed to developing a very general problem-solving framework.
A different approach to reinforcement learning
TTT-Discover provides a fundamental change on how logic models are trained. In standard reinforcement learning (RL) training, the goal is to have a generalist policy that performs well on average across multiple tasks. In TTT-Discover, the goal is to find the best solution to a very specific problem, and the policy is “a means toward this goal” according to the authors. Once the model discovers the artifact (i.e., optimized code, evidence, or molecule) the neural network that generated it can be discarded.
To achieve this, the researchers engineered two specific components that differentiate TTT-Discover from standard reinforcement learning:
- entropic objective: Standard RL optimizes for average expected reward. If a model takes a risky path and fails, standard RL penalizes it. TTT-Discover flips it. it uses a "entropic objective" It increasingly weighs high-reward outcomes. This forces the model to ignore "Safe," Give average answers and search aggressively "eureka" Outliers, solutions that are less likely to be found but yield huge rewards.
-
PUCT Search: The system introduces PUCT, a tree-search algorithm inspired by alphazero. It explores different solution paths, creating a dataset of attempts. The model then trains on this dataset in real time, learning to identify which partial steps lead to higher-reward outcomes.
Importantly, this method works best on problems with a constant reward signal. The system needs a way to measure incremental progress such as "runtime in microseconds" Or "error rate" instead of binary "pass fail" Signal. This allows the model to pursue gradual improvements towards the optimal solution.
Economics of ‘heavy speculation’
For enterprises accustomed to paying a cent per API call, TTT-Discover’s cost profile requires a change in mindset. In their experiments, the researchers reported that a single search run involved about 50 training stages and thousands of rollouts, costing about $500 per problem.
TTT-Discover may be for “static, high-value assets”, as opposed to small and recurring problems that can be solved with existing models and approaches.
Consider a cloud-native enterprise running a data pipeline that processes petabytes of information nightly. If that pipeline depends on a specific SQL query or GPU kernel, optimizing that code by just 1% can save hundreds of thousands of dollars in annual compute costs. In this context, spending $500 to find a 50% faster kernel is a modest expense with immediate ROI.
"It is most useful for low-frequency, high-impact decisions, where a single improvement is worth far more than the calculation cost," Yuksekgonul said. "Supply chain routing, drug design and ingredients are traceable. In these settings, spending hundreds of dollars on a single discovery phase can easily pay for itself."
Implementation Considerations
One of the most important findings for enterprise adoption is that TTT-Discover does not require a proprietary frontier model. Researchers achieved cutting-edge results using GPT-OSS-120BOpenAI’s open-weight model. researchers have code released TTT-Discover enables researchers and developers to use it for their own models.
Because the technology works with an open model, companies can run it "Discovery Loop" Completely within your own secure VPC or on-premise H100 cluster without sending your proprietary data to third-party servers.
“If a company is already doing reinforcement learning, no additional infrastructure is needed,” Yuksekgonul said. “TTT-Discover uses the same training stack (GPU, rollout worker, optimizer, checkpointing).”
If they don’t already run RL, they will need to build that infrastructure. But enterprises can also use existing solutions to reduce the complexity of the process. Researchers organized these trainings using tkinter api By Thinking Machines, an API that manages the complexity of distributed training and inference.
“Tooling like Tkinter (and open variants, e.g., OpenTinker) reduces setup costs, and both labor and computation costs are likely to decrease over time,” he said.
Real World Use Cases
Researchers deployed TTT-Discover in four different technical domains: systems engineering, algorithm design, biology, and mathematics. In almost every instance, the method established a new state of the art.
In one experiment, the model optimized GPU kernels for matrix multiplication (including "Trimul" kernel is used alphafold), achieving execution speeds up to 2x faster than the previous state-of-the-art and outperforming the best human-written kernels on the leaderboard.
In competitive programming scenarios (atcoder), it solved complex heuristic problems (for example, optimizing geometric constraints for fishing nets) better than top human experts and prior AI baselines.
For the enterprise, the transition from these academic benchmarks to business value depends on one specific hurdle: the existence of a verifiable, scalar signal. Unlike text-generating chatbots, TTT-Discover requires a hard metric to optimize (for example, runtime, error rate, or profit margin).
Yuksekgonul said this requirement draws a clear line between where this technology should and should not be used. "At the moment, the main requirement is a reliable scalar signal of progress – cost, error, molecular properties – which the system can optimize," He said.
It guides enterprise adoption "difficult" Engineering and operations challenges such as logistics, supply chain and resource management, where problems such as fleet routing or crew scheduling often rely on static estimates. TTT-Discover can treat these as optimization environments, spending hours finding a route structure that cuts daily fuel costs by 5%.
The need for clear verifiers regulates qualitative tasks such as "Write a better marketing strategy," Where validation is subjective and prone to noise.
"Difficult to verify problems are still an open question,” Yuksekgonul said.
With current technology, the best way forward is to try to design validators, but “making those validators robust and hard to game is challenging, and we don’t have a good solution yet," He added.
From guess to invention
The broader implication is that enterprise AI stacks may need to evolve to support this type of per-problem learning.
“Systems built around a frozen model will need to support per-problem (or per-domain) customization, and enterprises will need better problem specifications and internal feedback signals to make test-time learning effective,” Yuksekgonul said. “If training runs inside a private VPC, the training loop can also be integrated with the company’s internal environment in addition to just the central lab pipeline.”
For the enterprise, value lies in recognizing "Million-dollar problems, “optimization challenges where a verifiable metric exists, but human progress has stalled. These are candidates for TTT-Discover. By accepting higher latency and cost for specific queries, enterprises can turn their inference calculations into an automated R&D laboratory, finding solutions that were previously out of reach for both humans and frozen AI models.
<a href