
Artificial intelligence agents powered by the world's most advanced language models routinely fail to complete even straightforward professional tasks on their own, according to groundbreaking research released Thursday by Upwork, the largest online work marketplace.
But the same study reveals a more promising path forward: When AI agents collaborate with human experts, project completion rates surge by up to 70%, suggesting the future of work may not pit humans against machines but rather pair them together in powerful new ways.
The findings, drawn from more than 300 real client projects posted to Upwork's platform, marking the first systematic evaluation of how human expertise amplifies AI agent performance in actual professional work — not synthetic tests or academic simulations. The research challenges both the hype around fully autonomous AI agents and fears that such technology will imminently replace knowledge workers.
"AI agents aren't that agentic, meaning they aren't that good," Andrew Rabinovich, Upwork's chief technology officer and head of AI and machine learning, said in an exclusive interview with VentureBeat. "However, when paired with expert human professionals, project completion rates improve dramatically, supporting our firm belief that the future of work will be defined by humans and AI collaborating to get more work done, with human intuition and domain expertise playing a critical role."
How AI agents performed on 300+ real freelance jobs—and why they struggled
Upwork's Human+Agent Productivity Index (HAPI) evaluated how three leading AI systems — Gemini 2.5 Pro, OpenAI's GPT-5, and Claude Sonnet 4 — performed on actual jobs posted by paying clients across categories including writing, data science, web development, engineering, sales, and translation.
Critically, Upwork deliberately selected simple, well-defined projects where AI agents stood a reasonable chance of success. These jobs, priced under $500, represent less than 6% of Upwork's total gross services volume — a tiny fraction of the platform's overall business and an acknowledgment of current AI limitations.
"The reality is that although we study AI, and I've been doing this for 25 years, and we see significant breakthroughs, the reality is that these agents aren't that agentic," Rabinovich told VentureBeat. "So if we go up the value chain, the problems become so much more difficult, then we don't think they can solve them at all, even to scratch the surface. So we specifically chose simpler tasks that would give an agent some kind of traction."
Even on these deliberately simplified tasks, AI agents working independently struggled. But when expert freelancers provided feedback — spending an average of just 20 minutes per review cycle — the agents' performance improved substantially with each iteration.
20 minutes of human feedback boosted AI completion rates up to 70%
The research reveals stark differences in how AI agents perform with and without human guidance across different types of work. For data science and analytics projects, Claude Sonnet 4 achieved a 64% completion rate working alone but jumped to 93% after receiving feedback from a human expert. In sales and marketing work, Gemini 2.5 Pro's completion rate rose from 17% independently to 31% with human input. OpenAI's GPT-5 showed similarly dramatic improvements in engineering and architecture tasks, climbing from 30% to 50% completion.
The pattern held across virtually all categories, with agents responding particularly well to human feedback on qualitative, creative work requiring editorial judgment — areas like writing, translation, and marketing — where completion rates increased by up to 17 percentage points per feedback cycle.
The finding challenges a fundamental assumption in the AI industry: that agent benchmarks conducted in isolation accurately predict real-world performance.
"While we show that in the tasks that we have selected for agents to perform in isolation, they perform similarly to the previous results that we've seen published openly, what we've shown is that in collaboration with humans, the performance of these agents improves surprisingly well," Rabinovich said. "It's not just a one-turn back and forth, but the more feedback the human provides, the better the agent gets at performing."
Why ChatGPT can ace the SAT but can't count the R's in 'strawberry'
The research arrives as the AI industry grapples with a measurement crisis. Traditional benchmarks — standardized tests that AI models can master, sometimes scoring perfectly on SAT exams or mathematics olympiads — have proven poor predictors of real-world capability.
"With advances of large language models, what we're now seeing is that these static, academic datasets are completely saturated," Rabinovich said. "So you could get a perfect score in the SAT test or LSAT or any of the math olympiads, and then you would ask ChatGPT how many R's there are in the word strawberry, and it would get it wrong."
This phenomenon — where AI systems ace formal tests but stumble on trivial real-world questions — has led to growing skepticism about AI capabilities, even as companies race to deploy autonomous agents. Several recent benchmarks from other firms have tested AI agents on Upwork jobs, but those evaluations measured only isolated performance, not the collaborative potential that Upwork's research reveals.
"We wanted to evaluate the quality of these agents on actual real work with economic value associated with it, and not only see how well these agents do, but also see how these agents do in collaboration with humans, because we sort of knew already that in isolation, they're not that advanced," Rabinovich explained.
For Upwork, which connects roughly 800,000 active clients posting more than 3 million jobs annually to a global pool of freelancers, the research serves a strategic business purpose: establishing quality standards for AI agents before allowing them to compete or collaborate with human workers on its platform.
The economics of human-AI teamwork: Why paying for expert feedback still saves money
Despite requiring multiple rounds of human feedback — each lasting about 20 minutes — the time investment remains "orders of magnitude different between a human doing the work alone, versus a human doing the work with an AI agent," Rabinovich said. Where a project might take a freelancer days to complete independently, the agent-plus-human approach can deliver results in hours through iterative cycles of automated work and expert refinement.
The economic implications extend beyond simple time savings. Upwork recently reported that gross services volume from AI-related work grew 53% year-over-year in the third quarter of 2025, one of the strongest growth drivers for the company. But executives have been careful to frame AI not as a replacement for freelancers but as an enhancement to their capabilities.
"AI was a huge overhang for our valuation," Erica Gessert, Upwork's CFO, told CFO Brew in October. "There was this belief that all work was going to go away. AI was going to take it, and especially work that's done by people like freelancers, because they are impermanent. Actually, the opposite is true."
The company's strategy centers on enabling freelancers to handle more complex, higher-value work by offloading routine tasks to AI. "Freelancers actually prefer to have tools that automate the manual labor and repetitive part of their work, and really focus on the creative and conceptual part of the process," Rabinovich said.
Rather than replacing jobs, he argues, AI will transform them: "Simpler tasks will be automated by agents, but the jobs will become much more complex in the number of tasks, so the amount of work and therefore earnings for freelancers will actually only go up."
AI coding agents excel, but creative writing and translation still need humans
The research reveals a clear pattern in agent capabilities. AI systems perform best on "deterministic and verifiable" tasks with objectively correct answers, like solving math problems or writing basic code. "Most coding tasks are very similar to each other," Rabinovich noted. "That's why coding agents are becoming so good."
In Upwork's tests, web development, mobile app development, and data science projects — especially those involving structured, computational work — saw the highest standalone agent completion rates. Claude Sonnet 4 completed 68% of web development jobs and 64% of data science projects without human help, while Gemini 2.5 Pro achieved 74% on certain technical tasks.
But qualitative work proved far more challenging. When asked to create website layouts, write marketing copy, or translate content with appropriate cultural nuance, agents floundered without expert guidance. "When you ask it to write you a poem, the quality of the poem is extremely subjective," Rabinovich said. "Since the rubrics for evaluation were provided by humans, there's some level of variability in representation."
Writing, translation, and sales and marketing projects showed the most dramatic improvements from human feedback. For writing work, completion rates increased by up to 17 percentage points after expert review. Engineering and architecture projects requiring creative problem-solving — like civil engineering or architectural design — improved by as much as 23 percentage points with human oversight.
This pattern suggests AI agents excel at pattern matching and replication but struggle with creativity, judgment, and context — precisely the skills that define higher-value professional work.
Inside the research: How Upwork tested AI agents with peer-reviewed scientific methods
Upwork partnered with elite freelancers on its platform to evaluate every deliverable produced by AI agents, both independently and after each cycle of human feedback. These evaluators created detailed rubrics defining whether projects met core requirements specified in job descriptions, then scored outputs across multiple iterations.
Importantly, evaluators focused only on objective completion criteria, excluding subjective factors like stylistic preferences or quality judgments that might emerge in actual client relationships. "Rubric-based completion rates should not be viewed as a measure of whether an agent would be paid in a real marketplace setting," the research notes, "but as an indicator of its ability to fulfill explicitly defined requests."
This distinction matters: An AI agent might technically complete all specified requirements yet still produce work a client rejects as inadequate. Conversely, subjective client satisfaction — the true measure of marketplace success — remains beyond current measurement capabilities.
The research underwent double-blind peer review and was accepted to NeurIPS, the premier academic conference for AI research, where Upwork will present full results in early December. The company plans to publish a complete methodology and make the benchmark available to the research community, updating the task pool regularly to prevent overfitting as agents improve.
"The idea is for this benchmark to be a living and breathing platform where agents can come in and evaluate themselves on all categories of work, and the tasks that will be offered on the platform will always update, so that these agents don't overfit and basically memorize the tasks at hand," Rabinovich said.
Upwork's AI strategy: Building Uma, a 'meta-agent' that manages human and AI workers
The research directly informs Upwork's product roadmap as the company positions itself for what executives call "the age of AI and beyond." Rather than building its own AI agents to complete specific tasks, Upwork is developing Uma, a "meta orchestration agent" that coordinates between human workers, AI systems, and clients.
"Today, Upwork is a marketplace where clients look for freelancers to get work done, and then talent comes to Upwork to find work," Rabinovich explained. "This is getting expanded into a domain where clients come to Upwork, communicate with Uma, this meta-orchestration agent, and then Uma identifies the necessary talent to get the job done, gets the tasks outcomes completed, and then delivers that to the client."
In this vision, clients would interact primarily with Uma rather than directly hiring freelancers. The AI system would analyze project requirements, determine which tasks require human expertise versus AI execution, coordinate the workflow, and ensure quality — acting as an intelligent project manager rather than a replacement worker.
"We don't want to build agents that actually complete the tasks, but we are building this meta orchestration agent that figures out what human and agent talent is necessary in order to complete the tasks," Rabinovich said. "Uma evaluates the work to be delivered to the client, orchestrates the interaction between humans and agents, and is able to learn from all the interactions that happen on the platform how to break jobs into tasks so that they get completed in a timely and effective manner."
The company recently announced plans to open its first international office in Lisbon, Portugal, by the fourth quarter of 2026, with a focus on AI infrastructure development and technical hiring. The expansion follows Upwork's record-breaking third quarter, driven partly by AI-powered product innovation and strong demand for workers with AI skills.
OpenAI, Anthropic, and Google race to build autonomous agents—but reality lags hype
Upwork's findings arrive amid escalating competition in the AI agent space. OpenAI, Anthropic, Google, and numerous startups are racing to develop autonomous agents capable of complex multi-step tasks, from booking travel to analyzing financial data to writing software.
But recent high-profile stumbles have tempered initial enthusiasm. AI agents frequently misunderstand instructions, make logical errors, or produce confidently wrong results — a phenomenon researchers call "hallucination." The gap between controlled demonstration videos and reliable real-world performance remains vast.
"There have been some evaluations that came from OpenAI and other platforms where real Upwork tasks were considered for completion by agents, and across the board, the reported results were not very optimistic, in the sense that they showed that agents—even the best ones, meaning powered by most advanced LLMs — can't really compete with humans that well, because the completion rates are pretty low," Rabinovich said.
Rather than waiting for AI to fully mature — a timeline that remains uncertain—Upwork is betting on a hybrid approach that leverages AI's strengths (speed, scalability, pattern recognition) while retaining human strengths (judgment, creativity, contextual understanding).
This philosophy extends to learning and improvement. Current AI models train primarily on static datasets scraped from the internet, supplemented by human preference feedback. But most professional work is qualitative, making it difficult for AI systems to know whether their outputs are actually good without expert evaluation.
"Unless you have this collaboration between the human and the machine, where the human is kind of the teacher and the machine is the student trying to discover new solutions, none of this will be possible," Rabinovich said. "Upwork is very uniquely positioned to create such an environment because if you try to do this with, say, self-driving cars, and you tell Waymo cars to explore new ways of getting to the airport, like avoiding traffic signs, then a bunch of bad things will happen. In doing work on Upwork, if it creates a wrong website, it doesn't cost very much, and there's no negative side effects. But the opportunity to learn is absolutely tremendous."
Will AI take your job? The evidence suggests a more complicated answer
While much public discourse around AI focuses on job displacement, Rabinovich argues the historical pattern suggests otherwise — though the transition may prove disruptive.
"The narrative in the public is that AI is eliminating jobs, whether it's writing, translation, coding or other digital work, but no one really talks about the exponential amount of new types of work that it will create," he said. "When we invented electricity and steam engines and things like that, they certainly replaced certain jobs, but the amount of new jobs that were introduced is exponentially more, and we think the same is going to happen here."
The research identifies emerging job categories focused on AI oversight: designing effective human-machine workflows, providing high-quality feedback to improve agent performance, and verifying that AI-generated work meets quality standards. These skills—prompt engineering, agent supervision, output verification—barely existed two years ago but now command premium rates on platforms like Upwork.
"New types of skills from humans are becoming necessary in the form of how to design the interaction between humans and machines, how to guide agents to make them better, and ultimately, how to verify that whatever agentic proposals are being made are actually correct, because that's what's necessary in order to advance the state of AI," Rabinovich said.
The question remains whether this transition— from doing tasks to overseeing them — will create opportunities as quickly as it disrupts existing roles. For freelancers on Upwork, the answer may already be emerging in their bank accounts: The platform saw AI-related work grow 53% year-over-year, even as fears of AI-driven unemployment dominated headlines.
