A weekend ‘vibe code’ hack by Andrej Karpathy quietly sketches the missing layer of enterprise AI orchestration

nuneybits A glowing computer screen on a laptop computer code s c5238c15 75e5 4c2b 9f4b e530ad71adf0
This weekend, Andrzej Karpathy, former director of AI at Tesla and founding member of OpenAI, decided he wanted to read a book. But he did not want to read it alone. He wanted to read it with a committee of artificial intelligence, each presenting their own perspective, criticizing the others, and ultimately synthesizing the final answer under the guidance of one. "chairman."

To do this, Carpathy wrote what he called "vibe code project" – A piece of software, increasingly written primarily by AI assistants, whose purpose is entertainment rather than function. They posted the results, called a repository "LLM Council," GitHub with a clear disclaimer: "I wouldn’t support it in any way… the code is now ephemeral and the libraries are gone."

Yet, for technical decision makers in the enterprise landscape, looking beyond the casual disclaimer reveals something far more important than a weekend toy. In a few hundred lines of Python and JavaScript, Carpathy has sketched a reference architecture for the most important, undefined layer of the modern software stack: the orchestration middleware sitting between corporate applications and the volatile market of AI models.

As companies finalize their platform investments for 2026, the LLM Council takes a brief look at "build vs buy" The reality of AI infrastructure shows that the logic behind routing and aggregating AI models is surprisingly simple, but the operational wrapper required to make it enterprise-ready is where the real complexity lies.

How the LLM Council works: four AI models debate, criticize and synthesize answers

To the casual observer, the LLM Council web application looks almost identical to ChatGPT. A user types a query in the chat box. But behind the scenes, the application triggers a sophisticated, three-step workflow that mirrors how human decision-making entities work.

First, the system sends the user’s query to a panel of the frontier model. In Carpathy’s default configuration, it includes OpenAI’s GPT-5.1, Google’s Gemini 3.0 Pro, Anthropic’s Cloud Sonnet 4.5, and xAI’s Grok 4. These models generate their initial responses in parallel.

In the second stage, the software undergoes a peer review. Each model is given anonymous responses from its peers and asked to evaluate them based on accuracy and insight. This step turns the AI ​​from a generator into a critic, creating a layer of quality control that is rare in standard chatbot interactions.

Finally, a named "Chairman LLM" – Currently configured as Google’s Gemini 3 – Receives original queries, personalized responses, and peer rankings. It synthesizes this set of references into a single, authoritative answer for the user.

Karpathy said the results were often surprising. "Often, models are surprisingly willing to consider another LLM’s response as better than their own," he wrote on X (formerly Twitter). He described using the tool to read chapters of the book, noting that the models consistently praised GPT-5.1 as the most informative, while rating Cloud the lowest. However, Karpathy’s own qualitative assessment differed from that of his digital council; He got GPT-5.1 "very verbose" and gave priority to "condensed and processed" Gemini’s output.

The case for treating FastAPI, OpenRouter, and Frontier models as swappable components

For CTOs and platform architects, the value of the LLM Council lies not in its literary criticism, but in its creation. The repository serves as a primary document that shows what a modern, minimal AI stack looks like in the end of 2025.

The application is built on a "Thin" architecture. The backend uses FastAPI, a modern Python framework, while the frontend is a standard React application built with wight. Data storage is handled not by complex databases, but by simple JSON files written to the local disk.

The mainstay of the entire operation is OpenRouter, an API aggregator that normalizes the interoperability between different model providers. By routing requests through this single broker, Carpathy avoided writing separate integration code for OpenAI, Google, and Anthropic. The application does not know or care which company provides the intelligence; It simply sends a signal and waits for a response.

This design choice highlights a growing trend in enterprise architecture: commoditization of the model layer. By treating Frontier models as interchangeable components that can be swapped out by editing a single line in the configuration file – specifically the COUNCIL_MODELS list in the backend code – the architecture protects applications from vendor lock-in. If a new model from Meta or Mistral tops the leaderboard next week, it can be added to the council in a matter of seconds.

What’s missing from prototype to production: authentication, PII amendments and compliance

While the LLM Council’s basic argument is elegant, it also clearly illustrates the difference between "weekend hack" And a production system. For an enterprise platform team, cloning Carpathy’s repository is just the first step of a marathon.

Technical audit of code reveals missing "boring" Infrastructure that commercial vendors sell at premium prices. The system lacks authentication; Anyone with access to the web interface can query the model. There is no concept of user roles, meaning the junior developer has the same access rights as the CIO.

Furthermore, the governance layer is non-existent. In a corporate environment, sending data to four different external AI providers simultaneously raises immediate compliance concerns. There is no mechanism here to modify personally identifiable information (PII) before it leaves the local network, nor is there any audit log to track who asked what.

Reliability is another open question. The system assumes that the OpenRouter API is always running and that models will respond in a timely manner. It lacks the circuit breakers, fallback strategies, and retry logic that keep business-critical applications running when they suffer a provider outage.

These absences are not flaws in Karpathy’s code – he clearly stated that he does not intend to support or improve the project – but they define the value proposition for the commercial AI infrastructure market.

Companies like Langchain, AWS Bedrock, and various AI gateway startups are essentially selling "hard" Around the basic logic demonstrated by Karpathy. They provide security, observability, and compliance wrappers that turn a raw orchestration script into a viable enterprise platform.

Why does Karpathy believe the code is in place now? "short term" And traditional software libraries are obsolete

Perhaps the most exciting aspect of the project is the philosophy under which it was created. Karpathy described the development process as "99% vibe-coded," This means that he relied heavily on AI assistants to generate code rather than writing it line-by-line.

"The code is now short-lived and the libraries are gone, ask your LLM to change it as you wish," He wrote in the repository’s document.

This statement marks a paradigm shift in software engineering capability. Traditionally, companies create internal libraries and abstractions to manage complexity, and maintain them for years. Karpathy is suggesting a future where behavior will be done with code "quick scaffolding" – Disposable, easily rewritten by AI, and not long-lasting.

For enterprise decision makers, this is a difficult strategic question. If internal devices can be "vibe coded" Over the weekend, is there any point in purchasing an expensive, rigid software suite for internal workflows? Or should platform teams empower their engineers to create custom, disposable tools that meet their exact needs for a fraction of the cost?

When AI models assess AI: the dangerous gap between machine preferences and human needs

Beyond architecture, the LLM Council project inadvertently highlights a specific risk in automated AI deployment: divergence between human and machine judgment.

Karpathy’s observation that his models preferred GPT-5.1, while they preferred Gemini, suggests that AI models may have shared biases. They may favor verbosity, specific formatting, or rhetorical confidence that is not in line with human business requirements for brevity and accuracy.

As enterprises increasingly rely on "LLM as judge" For systems evaluating the quality of their customer-facing bots, this discrepancy matters. If the automated evaluator consistently rewards "wordy and diffuse" Answer: While human customers want short solutions, metrics will show success while customer satisfaction will go down. Karpathy’s experiment shows that relying solely on AI to grade AI is a strategy fraught with hidden alignment problems.

What enterprise platform teams can learn from a weekend hack before building their 2026 stack

Ultimately, the LLM Council serves as a Rorschach test for the AI ​​industry. For hobbyists, this is a fun way to read books. For the vendor, this is a threat, proving that the core functionality of their products can be replicated in a few hundred lines of code.

But for the enterprise technology leader, it is a reference architecture. This demystifies the orchestration layer, showing that the technical challenge is not in routing the signals, but in controlling the data.

As platform teams move into 2026, many will likely find themselves staring at Carpathy’s code, not to deploy it, but to understand it. This proves that a multi-model strategy is not technically out of reach. The question remains whether companies will build the governance layer themselves or pay someone else to complete it "vibe code" In enterprise-grade armor.



<a href

Leave a Comment