
AI that can see and understand what is happening in video – especially live feeds – is an attractive product for many enterprises and organizations. Beyond acting as security "Supervision" On sites and facilities, such AI models can also be used to clip the most exciting parts of marketing videos and repurpose them for social use, identify inconsistencies and flaws in videos and flag them for removal, and identify the body language and actions of participants in controlled studies or candidates applying for new roles.
Although there are some AI models today that provide this type of functionality, it is far from a mainstream capability. However, two-year-old startup Perceptron Inc. wants to change all that. Today, it announced the release of its flagship proprietary video analytics reasoning model, MK1 (for short) "mark one") at a cost – $0.15 per million token inputs / $1.50 per million outputs via its application programming interface (API) – which is about 80-90% lower than other major proprietary rivals, namely Anthropic’s Cloud Sonnet 4.5, OpenAI’s GPT-5, and Google’s Gemini 3.1 Pro.
Led by Armen Aghajanyan, co-founder and CEO, formerly of Meta FAIR and Microsoft, the company spent 16 months developing a. "Multi-Model Recipe" To overcome the complexities of the physical world from the ground up.
This launch signals a new era where models are expected to understand cause-and-effect, object dynamics, and the laws of physics with the same fluency they once applied to grammars.
Interested users and potential enterprise customers can try it out for themselves on Perceptron’s public demo site here.
Performance in spatial and video benchmarks
Model performance is supported by a suite of industry-standard benchmarks focused on ground understanding.
In spatial reasoning (ER benchmark), the MK1 achieved a score of 85.1 on EmbSpatialBench, beating out Google’s Robotics-ER 1.5 (78.4) and Alibaba’s Q3.5-27B (about 84.5).
In the exclusive RefSpatialBench, Mk1’s score of 72.4 represents a huge leap over competitors like GPT-5m (9.0) and Sonnet 4.5 (2.2), highlighting a significant advantage in referencing expression understanding.
Video benchmarks show similar dominance; On EgoSchema "hard subset"-Where first- and last-frame estimation is inadequate – the MK1 scored 41.4, which matches Alibaba’s Q3.5-27B and significantly outpaces the Gemini 3.1 Flash-Lite (25.0).
On VSI-Bench, the MK1 reached 88.5, the highest recorded score among the compared models, further validating its ability to handle real temporal logic tasks.
Market conditions and efficiency frontier
The perceptron has clearly targeted "efficiency frontier," A metric that plots the score and embedded logic benchmarks across videos against the compound cost per million tokens.
Benchmarking data shows that the Mk1 occupies a unique position: it matches or exceeds the performance "marginal" Models like GPT-5 and Gemini 3.1 Pro, while maintaining a cost profile close to "light" Or "Glow" version.
Specifically, the Perceptron MK1 is priced at $0.15 per million input tokens and $1.50 per million output tokens. In comparison, "efficiency limit" The chart shows GPT-5 at a significantly higher blended cost (closer to $2.00) and Gemini 3.1 Pro at around $3.00, while the Mk1 sits at the $0.30 blended cost mark with a better logic score.
This aggressive pricing strategy is aimed at making high-level physical AI accessible for large-scale industrial use rather than just experimental research.
Architecture and temporal continuity
The technical core of the Perceptron MK1 is the ability to process native video at up to 2 frames per second (fps) in a significant 32K token reference window.
Unlike traditional vision-language models (VLMs), which often treat video as a disjointed sequence of still images, MK1 is designed for temporal continuity.
This architecture allows the model to "Watch" Maintaining object identity even through extended streams and obstructions is a critical requirement for robotics and surveillance applications.
Developers can query the model for specific moments in a longer stream and receive structured time code in return, streamlining the process of video clipping and event detection.
reasoning with the laws of physics
A primary differentiator for the Mk1 is its "physical reasoning" Capacity. Perceptron defines it as a high-precision spatial awareness that allows models to understand object dynamics and physical interactions in real-world settings.
For example, the model may analyze a scene to determine whether a basketball shot was taken before or after the buzzer by jointly reasoning on the position of the ball in the air and the readout on the shot clock.
This requires more than just pattern recognition; This requires an understanding of how objects move through space and time.
model is enabled "pixel perfect" Pointing and counting by the hundreds in dense, complex scenes. It can also read analog gauges and clocks, which have historically been difficult for fully digital vision systems to interpret with high reliability.
It also appears to have strong general world and historical knowledge. In my brief testing, I uploaded an old public domain film of skyscraper construction in New York City in 1906 from the US Library of Congress, and the MK1 was not only able to correctly describe the content of the footage – including strange, unusual scenes of workers being hung by ropes – but so quickly and even correctly identify the approximate date (early 1900s) from the looks of the footage alone.
A developer platform for physical AI
Accompanying the model release is an expanded developer platform designed to transform these high-level perception capabilities into functional applications with minimal code.
The Perceptron SDK, available through Python, introduces a number of unique functions, such as "Center," "Counting," And "learning in context".
The Focus feature allows users to automatically zoom and crop to specific areas of the frame based on natural language prompts, such as detecting and localizing personal protective equipment (PPE) on a construction site. The counting function is optimized for dense scenes, such as identifying and pinpointing each puppy in a group or individual items of product.
Additionally, the platform supports in-context learning, allowing developers to adapt MK1 to specific tasks by providing only a few examples, such as showing an image of an apple and instructing the model to label each example of Category 1 in a new scene.
Licensing Strategies and the Isaac Chain
Perceptron is adopting a dual-track strategy for its model weights and licensing. The flagship Perceptron MK1 is a closed-source model accessed via API, designed for enterprise-grade performance and security.
However, the company is also maintaining its "Isaac" The series, which began with the launch of Isaac 0.1 in September 2025 as an open-weight option. ISAC 0.2-2b-preview, released in December 2025, is a 2-billion parameter vision-language model with reasoning capabilities that is available for edge and low-latency deployments.
While the weights for the Isaac model are open on the popular AI code sharing community Hugging Face, Perceptron offers commercial licenses for companies that need maximum control of the weights or for on-premises deployment.
This approach allows the company to support both the open-source community and specialized industrial partners who require proprietary flexibility. The documentation states that the Isaac 0.2 models are specifically optimized for sub-200ms time-to-first-token, making them ideal for real-time edge devices.
Background on the setup and focus of the perceptron
Perceptron AI is a Bellevue, Washington-based physical AI startup, founded by Aghajanyan and Akshat Srivastava, former research scientists at Meta’s Facebook AI Research (FAIR) Lab.
The company’s public materials list its founding date as November 2024, while Perceptron.ai Inc. Washington corporate filing records show a prior foreign registration filing dated October 9, 2024, which lists Srivastava and Aghajanyan as governors.
At the founding launch post in late 2024, Aghajanya said he had left Meta after about six years and “joined forces” with Srivastava to build AI for the physical world, while Srivastava said the company had grown from his work on efficiency, multimodality, and new model architectures.
Its founding appears to stem directly from the pair’s work on the multimodal foundation model in meta. In May 2024, Meta researchers published Chameleon, a family of early-fusion models designed to understand and generate mixed sequences of text and images, which Perceptron later described as part of the lineage behind their own models.
A follow-on paper, dated July 2024, MoMA explored more efficient early-fusion training for mixed-modality models and listed both Srivastava and Aghajanyan among the authors. Perceptron’s announced thesis extends that research direction to “physical AI”: models that can process real-world video and other sensory streams for use cases such as robotics, manufacturing, geospatial analysis, security, and content moderation.
Partner ecosystem and future outlook
The real-world impact of Mk1 is already being demonstrated through Perceptron’s partner network. Early adopters are using the model for a variety of applications, such as auto-clipping highlights from live sports, which leverages the model’s temporal understanding to identify key plays without human intervention.
In the robotics area, partners are curating teleoperation episodes into training data, effectively automating the process of labeling and cleaning data for robotic arms and mobile units.
Other use cases include multimodal quality control agents on manufacturing lines, which can detect defects and verify assembly steps in real time, and wearable assistants on smart glasses that provide context-aware assistance to users.
Aghajanyan said the releases are the culmination of research aimed at making AI work best in the physical world, moving toward a future where "physical ai" Digital is as ubiquitous as AI.
<a href