Cohere open-sources a coding agent that runs on a single H100

cohere north smk1
Engineering teams building agentic coding pipelines now have a solid open-source alternative to the managed model like CloudFable 5 – which runs on a single H100. The tradeoff: Cohere’s North Mini code, launched Tuesday, produced three times the output tokens of comparable models in independent testing, a verbosity cost that compounds in high-volume production workloads.

The new open-source model is a 30 billion parameter mix-of-experts (MOE) model with 3 billion parameters activated per token, built for agentic software engineering, including sub-agent orchestration, architecture mapping, code review, and terminal work. The model supports a 256,000 token context window with a maximum generation length of 64,000 tokens, and is available on Hugging Face under the Apache 2.0 license.

What can North Mini Code do

North Mini Code targets the full agentic coding stack. Here’s what the model does and what it runs on.

software engineering. Cohere created the North Mini Code specifically for agentic software engineering, not adapted from a general-purpose base. It has integrated tool-use capabilities and supports interleaved thinking, which Cohere says improves performance in multi-step agentic tasks.

Architecture mapping and code review. North Mini Code can analyze and map system architecture, surface dependencies, and perform code review in large codebases. With a 256,000 token reference window, it can hold substantial multi-file projects in a single reference pass.

Terminal-Based Agentic FunctionsS. The model is trained to handle terminal environments, shell interactions, package scripts, and command-line tooling. Cohere benchmarked it on Terminal-Bench v2, which tests agents in a real terminal environment rather than synthetic code generation tasks.

How was it built?

The North Mini Code is a sparse mix-expert model with 128 experts, 8 of which are active per token. The computation required at inference time is closer to a 3 billion parameter model despite 30 billion total parameters. Cohere co-founder Nick Frost demonstrated this by running it on a Mac Studio via MLX on about 20 gigabytes of RAM, the same machine he uses for his local coding work.

Cohear trained the model through two stages of supervised fine-tuning, followed by reinforcement learning with verifiable rewards on over 70,000 verifiable tasks spanning nearly 5,000 repositories, duplicated against SWE-Bench.

Instead of optimizing against a single agent scaffold, Cohere trained in three. SWE-Agent uses a rich CLI with special commands. Mini-SWE-Agent uses a single bash tool with raw shell output. OpenCode uses individually typed tools that return structured JSON. Cohear reports a 10 percentage point gain on OpenCode evaluation with the multi-harness approach while maintaining SWE-agent performance.

where does it fit

North Mini Code enters a market that now includes Mistral Devstral Small 2, GitHub Copilot, Cursor, and Cloud Fable 5 – each with different cost and deployment tradeoffs.

Cohair’s primary benchmark comparison is against Mistral Devastral Small 2, a 24 billion parameter dense model. In internal tests reported by the vendor, Coherer claims 2.8x higher output throughput and a 30% inter-token latency advantage over the Devastral Small 2 in internal tests under the same hardware configuration. Cohere also claims in his Hugging Face technical post that the North Mini code outperforms open-source models by up to four times its parameter count on its reported benchmarks, including models with 120 billion parameters.

Synthetic analysis independently ranks it eighth out of 127 comparable open-weighted models at an output speed of 210 tokens per second, with a time to first token of 0.25 seconds, against the class median of 1.95 seconds. It is ranked 18th out of 127 on the Artificial Analysis Intelligence Index. A flag from the same data: The model generated 75 million output tokens to meet the intelligence index, against the class mean of 25 million. In high-volume agentive pipelines, that functionality gets mixed into predictable costs and latency.

"Suddenly people are thinking, hey, am I getting enough economic value out of a token from a model?" Frost said during the launch video. "Local deployment is a way to empower people and really make AI something that works for them."

GitHub Copilot, Cursor, and Cloud Code work on per-use or subscription pricing with no on-premises option. Anthropic’s Cloud Fable 5, the most capable managed coding model now publicly available, runs at $50 per million output tokens. For Frost, the model is the exact opposite of Fable.

"It is small, cost effective, based on Apache 2.0 and locally deployable. LLM should follow this path. Small, open source, transparent and sovereign, versus big, expensive, proprietary and hegemonic," Frost wrote in a post on X.

What does this mean for enterprises

For teams building production agentive coding pipelines, the release of North Mini Code clarifies a set of decisions that have been in the making for months.

Purpose-built agentic training is now the baseline for evaluation. With verified tool calls and multi-harness robustness, the difference between models fine-tuned for code and models trained specifically for agentic workflows is now a material factor in pipeline decisions. Any model vendor claiming agentic coding capability should be able to answer whether its training used verifiable agentic functions or was adapted from a general-purpose base.

Verbosity is a hidden pipeline cost that the benchmark does not reveal. Artificial analysis measured the Northern Mini code generating three times as many output tokens as comparable models. That verbiage adds to the estimated cost and latency in high-volume pipelines. Throughput testing against actual workload volume is the evaluation step that benchmark rankings skip.

Marginal pricing partitioning is now a real architectural decision. The hypothetical 5 and North Mini codes on a single H100 at $50 per million output tokens represent a real tradeoff between cost control and data residency on the one hand and managed infrastructure on the other. Teams running high-volume agentive coding pipelines should model both cost paths according to their actual workload before committing to any work.



<a href

Leave a Comment