
A stealth artificial intelligence startup founded by an MIT researcher came out this morning with an ambitious claim: Its new AI model can control computers better than systems built by OpenAI and Anthropic at a fraction of the cost.
OpenAGI, led by Chief Executive Zengyi Qin, released Lux, a foundation model designed to let computers operate autonomously by interpreting screenshots and performing tasks in desktop applications. The San Francisco-based company says Lux achieved an 83.6 percent success rate on Online-Mind2Web, a benchmark that has become the industry’s most rigorous test for evaluating AI agents that control computers.
This score is a significant leap compared to leading models from well-funded competitors. OpenAI’s operator, released in January, scored 61.3 percent on the same benchmark. Anthropic’s cloud computer usage achieves 56.3 percent.
"Traditional LLM training models feed large amounts of text corpus. The model learns to create text," Kin said in an exclusive interview with VentureBeat. "In contrast, our model learns to generate actions. The model is trained with a large amount of computer screenshots and action sequences, from which it can generate actions to control the computer."
This announcement comes at a critical moment for the AI industry. Technology giants and startups alike have invested billions of dollars in developing autonomous agents capable of navigating software, booking travel, filling out forms, and executing complex workflows. OpenAI, Anthropic, Google, and Microsoft have released or announced agent products in the past year, betting that computer-controlled AI will become as transformative as chatbots.
Yet independent research has cast doubt on whether current agents are as effective as their manufacturers suggest.
Why university researchers created a tough benchmark to test AI agents – and what they discovered
The online Mind2Web benchmark, developed by researchers at Ohio State University and the University of California, Berkeley, was specifically designed to highlight the gap between marketing claims and actual performance.
Published in April and accepted at the Conference on Language Modeling 2025, the benchmark includes 300 diverse tasks on 136 real websites – everything from booking flights to navigating complex e-commerce checkouts. Unlike earlier benchmarks that cached parts of websites, Online-Mind2Web tests agents in a live online environment where pages change dynamically and unexpected obstacles appear.
According to researchers, the results were colorful "A very different picture of the efficacy of current agents, which suggests over-optimism in previously reported results."
When the Ohio State team tested five leading web agents with careful human evaluation, they found that many recent systems – despite heavy investment and marketing fanfare – performed no better than SeeAct, a relatively simple agent released in January 2024. Even OpenAI’s operators, which were the best-performing commercial offerings in their study, only achieved 61 percent success.
"It seemed that highly capable and insightful agents were actually just a few months away," the researchers wrote in a blog post accompanying their paper. "However, we are also well aware that there are still many fundamental shortcomings in research into fully autonomous agents, and current agents are probably not as capable as reported benchmark numbers might indicate."
With a public leaderboard hosted on Hugging Face Tracking submissions from research groups and companies, the benchmark has gained popularity as an industry standard.
How OpenAGI trained its AI to take actions instead of just generating text
OpenAGI’s claimed performance gains stem from what the company says "Agent Active Pre-Training," A training method that is fundamentally different from the way most large language models learn.
Traditional language models train on huge text corpora, learning to predict the next word in a sequence. The resulting systems are excellent at generating coherent text but were not designed to perform actions in a graphical environment.
According to Kin, Lux takes a different approach. The model is trained on computer screenshots paired with action sequences, learning to interpret visual interfaces and determine which clicks, keystrokes, and navigation steps will accomplish a given goal.
"This action allows the model to actively explore the computer environment, and such exploration generates new knowledge, which is then fed back to the model for training," Kin told VentureBeat. "This is a naturally self-evolving process, where a better model leads to better exploration, better exploration leads to better knowledge, and better knowledge leads to a better model."
This self-reinforcing training loop, if it works as described, may help explain how a small team can achieve results that elude larger organizations. Instead of requiring large static datasets, this approach would allow the model to continuously improve by generating its own training data through exploration.
OpenAGI also claims significant cost advantages. The company says Lux operates at about one-tenth the cost of OpenAI and Anthropic’s Frontier models while executing tasks faster.
Unlike browser-only competitors, Lux can control Slack, Excel, and other desktop applications
One key difference OpenAGI announced: Lux can control applications on entire desktop operating systems, not just web browsers.
Most commercially available computer-usage agents, including early versions of Anthropic’s cloud computer usage, focus primarily on browser-based tasks. That range doesn’t include the vast categories of productivity work that happens in desktop applications — spreadsheets in Microsoft Excel, communicating in Slack, design work in Adobe products, code editing in development environments.
OpenAGI says Lux can navigate these native applications, a capability that will significantly expand the addressable market for computer-accessible agents. The company is also releasing a developer software development kit along with the model, which allows third parties to build applications on top of Lux.
The company is also working with Intel to optimize Lux for edge devices, which will allow the model to run locally on laptops and workstations rather than requiring cloud infrastructure. This partnership could address enterprise concerns about sending sensitive screen data to external servers.
"We are partnering with Intel to optimize our model on Edge devices, which will make it the best computer-use model on the device." Kin said.
The company confirmed that it is in exploratory discussions with AMD and Microsoft about additional partnerships.
What happens when you ask an AI agent to copy your bank details
Computer-using agents present new security challenges that do not arise with traditional chatbots. An AI system capable of clicking buttons, entering text, and navigating applications can, if misdirected, cause significant harm – transferring money, deleting files, or exfiltrating sensitive information.
OpenAGI says it has built the security system directly into Lux. When the model encounters requests that violate its security policies, it refuses to proceed and alerts the user.
In an example given by the company, when a user asked the model "Copy my bank details and paste them into a new Google Doc," Lux replied with an internally logical step: "The user asks me to copy bank details, which is sensitive information. Based on security policy, I am not able to take this action." The model issued a warning to the user instead of executing the potentially dangerous request.
Such security measures will face intense scrutiny because of the proliferation of computer-access agents. Security researchers have already demonstrated instant injection attacks against early agent systems, where malicious instructions embedded in websites or documents can hijack the agent’s behavior. Whether Lux’s security mechanisms can withstand adversarial attacks remains to be tested by independent researchers.
MIT researcher who created two of GitHub’s most downloaded AI models
Qin brings an unusual combination of academic credentials and entrepreneurial experience to OpenAGI.
He completed his doctorate at the Massachusetts Institute of Technology in 2025, where his research focused on computer vision, robotics, and machine learning. His academic work has appeared at top venues including the Conference on Computer Vision and Pattern Recognition, the International Conference on Learning Representations, and the International Conference on Machine Learning.
Before founding OpenAGI, Qin created several widely adopted AI systems. JetMoE, a large language model on which he led development, demonstrated that a high-performance model could be trained from scratch for less than $100,000 – a fraction of the tens of millions typically required. According to a technical report, the model outperformed META’s LLaMA2-7B on standard benchmarks, which caught the attention of MIT’s Computer Science and Artificial Intelligence Laboratory.
His previous open-source projects achieved notable acceptance. OpenVoice, a voice cloning model, has accumulated nearly 35,000 stars on GitHub and ranks in the top 0.03 percent of open-source projects by popularity. MeloTTS, a text-to-speech system, has been downloaded more than 19 million times, making it one of the most widely used audio AI models since its 2024 release.
Qin also co-founded MyShell, an AI agent platform that has attracted six million users who have collectively created more than 200,000 AI agents. According to the company, users have had more than one billion interactions with agents on the platform.
In the billion-dollar race to create an AI that can control your computer
The computer-access agent market has attracted intense interest from investors and technology giants in the past year.
OpenAI released Operator in January, allowing users to instruct AI to complete tasks on the web. Anthropic continues to develop cloud computing usage, positioning it as a core capability of its cloud model family. Google has included agent features in its Gemini products. Microsoft has integrated agent capabilities into its CoPilot offerings and Windows.
Still, the market remains nascent. Enterprise adoption has been limited due to concerns about reliability, security, and the ability to handle edge cases that frequently occur in real-world workflows. The performance gaps revealed by benchmarks such as Online-Mind2Web show that current systems may not be ready for mission-critical applications.
OpenAGI enters this competitive landscape as an independent alternative, positioning itself with superior benchmark performance and low costs compared to the vast resources of its well-funded rivals. The company’s Lux model and developer SDK are available starting today.
Whether OpenAGI can translate benchmark dominance into real-world reliability remains the central question. The AI industry has a long history of impressive demos that falter in production, lab results that collapse before the chaos of real use. Benchmarks measure what they measure, and the distance between a controlled test and an 8-hour workday full of edge cases, exceptions, and surprises can be huge.
But if Lux performs the same way in the wild as it does in the lab, its impact goes far beyond the success of a startup. This would suggest that the path to capable AI agents runs not through the biggest checkbook but through the smartest architecture – that a small team with the right ideas can outperform the giants.
The technology industry has seen that story before. This rarely remains true for long.
<a href