Your developers are already running AI locally: Why on-device inference is the CISO’s new blind spot

AI perimeter
For the past 18 months, the CISO playbook for generative AI has been relatively simple: Control the browser.

Security teams tightened cloud access security broker (CASB) policies, blocking or monitoring traffic to known AI endpoints, and routing access through approved gateways. The operating model was clear: If sensitive data leaves the network for an external API call, we can inspect it, log it, and stop it. But that model is starting to break down.

A quiet hardware change is pushing the use of large language models (LLM) off the network to the endpoint. Call it Shadow AI 2.0, or the “bring your own model” (BYOM) era: employees are running enabled models locally, on laptops, offline, with no API calls and no explicit network signatures. The governance conversation is still framed as “data intrusion in the cloud,” but the more immediate enterprise risk is increasingly “uninvited findings inside the device.”"

When inference occurs locally, interactions do not appear in traditional data loss prevention (DLP). And when security can’t see it, it can’t manage it.

Why has local estimation suddenly become practical?

Two years ago, running a useful LLM on a work laptop was a strange stunt. Today, this is common practice for technical teams.

Three things came together:

  • Consumer-Grade Accelerators Got Serious: A MacBook Pro with 64 GB of integrated memory can often run quantized 70B-class models at usable speeds (with practical limits on reference length). What once required multi-GPU servers is now possible on high-end laptops for many real-world workflows.

  • Quantification went mainstream: It is now easier to compress models into smaller, faster formats that often fit in laptop memory with an acceptable quality tradeoff for many tasks.

  • Delivery is frictionless: Open-view models are just a command away, and the tooling ecosystem makes “download → run → chat” trivial.

Result: An engineer can remove multi-GB model artifacts, turn off Wi-Fi, and run locally sensitive workflows, running source code reviews, document summaries, drafting customer communications, even performing exploratory analysis on regulated datasets. No outbound packets, no proxy logs, no cloud audit trail.

From a network-security perspective, that activity may seem indistinguishable from “nothing happened.”

The risk is no longer just that data leaves the company

If the data isn’t leaving the laptop, why should the CISO care?

Because the key risks shift from intrusion to integrity, provenance and compliance. In practice, local estimation creates three types of blind spots that most enterprises have not turned on.

1. Code and decision contamination (integrity risk)

Local models are often adopted because they are fast, private and “require no approval.”" The downside is that they are not tested as frequently for enterprise environments.

A typical scenario: A senior developer downloads a community-tuned coding model because it benchmarks well. They paste internal authentication logic, payment flows, or infrastructure scripts to clean it up." The model produces output that looks competent, compiles and passes unit tests, but subtly undermines the security posture (weak input validation, unsafe defaults, brittle concurrent changes, dependency options that are not allowed internally). The engineer makes changes.

If that interaction happened offline, you would have no record that the AI ​​influenced the code path at all. And when you later perform incident response, you will be investigating the symptom (a vulnerability) without visibility into the main cause (unregulated model usage).

2. Licensing and IP Exposure (Compliance Risk)

Many high-performance models ship with licenses that include Restrictions on commercial useAttribution requirements, limitations on scope of use, or liabilities that may be inconsistent with proprietary product development. When employees run the model locally, that use can bypass the organization’s normal procurement and legal review process.

If a team uses a non-commercial model to generate production code, documentation, or product behavior, the company may inherit risks that later show up during M&A diligence, customer security reviews, or litigation. The tough part isn’t just the licensing terms, it’s the lack of inventory and traceability. Without a controlled model hub or usage records, you may not be able to prove what was used.

3. Model Supply Chain Risk (Emergence Risk)

Local estimating software also transforms the supply chain problem. Endpoints begin to accumulate large model artifacts and the toolchain around them: onloaders, converters, runtimes, plugins, UI shells, and Python packages.

There is an important technical nuance here: the file format matters. While new formats are preferred seftensor Designed to prevent outdated, arbitrary code execution pickle based pytorch files The malicious payload can execute only when loaded. If your developers are grabbing untested checkpoints from Hugging Face or other repositories, they aren’t just downloading data – they may be downloading an exploit.

Security teams have spent decades learning to treat unknown executables as hostile. BYOM requires extending that mindset to model artifacts and the surrounding runtime stack. The biggest organizational gap today is that most companies have no equivalent software bill of materials For models: provenance, hashing, allowed sources, scanning and lifecycle management.

Mitigating BYOM: Treat Model Weights Like Software Artifacts

You can’t solve local inference by blocking a URL. You need endpoint-aware controls and a developer experience that makes the safe path the easy path.

Here are three practical ways:

1. Take the rule down to the last point

Network DLP and CASB still matter for cloud use, but they are not enough for BYOM. Start considering local model use as an endpoint governance problem by looking for specific signs:

  • List and Locate: Scan for processes with high-fidelity indicators such as .gguf files, files larger than 2GB, llama.cpp or ulama, and generally the local audience default port 11434.

  • Process and Runtime Awareness: Monitor for frequently high GPU/NPU (Neural Processing Unit) usage from deprecated runtimes or unknown local inference servers.

  • Device Policy: Use Mobile Device Management (MDM) and Endpoint Detection and Response (EDR) Policies to control the installation of unapproved runtimes and enforce baseline hardening on engineering tools. The point is not to punish experimentation. This is to regain visibility.

2. Provide a paved road: An internal, curated model hub

Shadow AI is often the result of friction. Approved devices are too restrictive, too general, or too slow to approve. A better approach is to offer a curated internal catalog that includes:

  • Accepted models for common tasks (coding, summarizing, classifying)

  • Verified license and usage guidance

  • Pinned versions with hashes (preferring secure formats like SafeTensors)

  • Clear documentation for secure local use, including where sensitive data is and where it is not allowed. If you want developers to stop scavenging, give them something better.

3. Updated policy language: “Cloud services” is no longer enough

Most acceptable use policies talk about SaaS and cloud tools. BYOM requires a policy that clearly includes:

  • Downloading and running model artifacts on a corporate endpoint

  • acceptable source

  • License Compliance Requirements

  • Rules for using models with sensitive data

  • Retention and logging requirements for local inference tools, this does not need to be burdensome. This needs to be clear.

Transferring back to perimeter device

For a decade we moved security controls “up” into the cloud. Local inference is pulling a meaningful portion of the AI ​​activity “down” to the endpoint.

5 Signals Shadow AI has moved to the endpoint:

  • Large Model Artworks: Unexplained storage consumption by .gguf or .pt files.

  • Local Estimate Server: Processes listening on ports like 11434 (Olama).

  • GPU Usage Pattern: Increased GPU usage when offline or disconnected from a VPN.

  • Lack of Model Inventory: Inability to map code output to specific model versions.

  • License ambiguity: presence of "non commercial" Model weights in production manufacturing.

Shadow AI 2.0 is not an imaginary future, it is a predictable result of faster hardware, easier distribution, and developer demand. CISOs who focus only on network controls won’t know what’s happening on the silicon sitting on employees’ desks.

The next phase of AI governance is less about blocking websites and more about controlling artifacts, provenance, and policy at the endpoint without sacrificing productivity.

Jayachander Reddy Kandakatla is a senior MLOps engineer.



<a href

Leave a Comment