P-e-w/heretic: Fully Automatic Censorship Removal For Language Models

Heretic is a tool that removes censorship (aka “security alignment”) from Transformer-based language models without expensive training. It combines an advanced implementation of directional separation, also known as “elimination” (Arditti et al. 2024), with a TPE-based parameter optimizer powered by Optuna.

This approach enables Heretic to work Completely automatically. The heuristic finds high-quality deletion parameters by co-minimizing the number of rejections and KL divergence from the original model. This results in a decensored model that retains as much of the intelligence of the original model as possible. Using the Heretic does not require an understanding of transformer internals. In fact, anyone who knows how to run a command-line program can use Heretic to censor language models.

screenshot

Running unsupervised with the default configuration, Heretic can produce descaled models that rival the quality of deletions created manually by human experts:

The heterozygous version, generated without any human effort, achieves the same level of denial suppression as the other eliminations, but at much lower KL divergence, indicating less loss to the capabilities of the original model.
(You can reproduce those numbers using Heretic’s built-in valuation functionality, e.g. heretic --model google/gemma-3-12b-it --evaluate-model p-e-w/gemma-3-12b-it-hereticNote that the exact values may be platform- and hardware-dependent, The above table was compiled using PyTorch 2,8 on an RTX 5090,)

Heretic supports most dense models, including many multimodel models and many different MoE architectures. It does not yet support SSM/hybrid models, models with inhomogeneous layers, and some novel attention systems.

You can find a collection of models that have been censored using Hugging Face on Heretic.

Set up a Python 3.10+ environment with PyTorch 2.2+ appropriate for your hardware. Then run:

pip install heretic-llm
heretic Qwen/Qwen3-4B-Instruct-2507

replace the Qwen/Qwen3-4B-Instruct-2507 Whatever model you want to censor with.

The process is completely automated and does not require configuration; However, Heretic has a variety of configuration parameters that can be changed for greater control. run heretic --help To view available command-line options, or view config.default.toml If you prefer to use a configuration file.

At the beginning of running the program, Heretic benchmarks the system to determine the optimal batch size to take maximum advantage of the available hardware. On an RTX 3090, with the default configuration, Llama-3.1-8B takes about 45 minutes to decensor.

After Heretic has finished censoring a model, you are given the option to save the model, upload it to Hugging Face, chat with it to check how well it works, or any combination of those actions.

Implements a parametrized version of heteroscedastic directional separation. For each supported Transformer component (currently, attention out-projection and MLP down-projection), it identifies the corresponding matrix in each Transformer layer, and orthogonalizes them with respect to the relevant “reject direction”, preventing expression of that direction in the result of multiplication with that matrix.

Denial instructions are calculated as the difference between the first-token residues for “harmful” and “harmless” example signals for each layer.

The resection process is controlled by several customizable parameters:

direction_index: either index of denial direction, or special value per layerThis indicates that each layer should be separated using the rejection direction associated with that layer.
max_weight, max_weight_position, min_weightAnd min_weight_distance: For each component, these parameters describe the size and position of the ablation weight kernel over the layers. The following picture shows this:

Explanation

Heretic’s main innovations over existing elimination systems are:

The size of the ablation weight kernel is highly flexible, which combined with automatic parameter optimization can improve the compliance/quality tradeoff. Non-continuous ablative load was first detected by Maxime Labonne in gemma-3-12b-it-abliterated-v2.
The denial direction index is a float rather than an integer. For non-integral values, the two nearest negation direction vectors are linearly interpolated. This unlocks a vast space of additional directions beyond those identified by the inter-means calculation, and often enables the optimization process to find a better direction than the direction corresponding to any individual layer.
Ablation parameters are selected separately for each component. I have found that MLP interferences are more damaging to the model than attention interferences, so using different ablation weights may yield some additional performance.

I am aware of the following publicly available implementations of deletion techniques:

Note that Heretic was written from scratch, and does not reuse code from any of those projects.

The development of Virdhi was informed by:

This program is free software: you can redistribute and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of that License, or (at your option) any later version.

This program is distributed in the expectation that it will be useful, but without any warranty; WITHOUT IMPLIED WARRANTY OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more information.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see https://www.gnu.org/licenses/.

By contributing to this project, you agree to release your contribution under the same license.

p-e-w/heretic: Fully automatic censorship removal for language models

Like this:

Related

Leave a Comment Cancel reply

Share this:

Like this:

Related

Leave a Comment Cancel reply