Jailbreak Detection for LLMs

Venator
Jailbreak detection through LLM activation probing.

Venator is an automated pipeline for training and optimising detectors that can identify jailbreak attempts on language models by reading their hidden activations during inference. Rather than analysing the text of a prompt or trusting the model's output, it looks at what the language model is doing internally. It appears jailbreaks leave a very clear signature.

Check out the code here -> https://github.com/peterstringer/venator

A linear probe (logistic regression) on middle-layer activations from Mistral-7B achieves 0.999 AUROC with around 30 labelled jailbreak examples. Even using just five labelled jailbreak examples achieves an AUROC of 0.996.

How it works

During inference, Venator extracts hidden state activations from the transformer's middle layers — the point where the model has moved past token-level processing and started representing input intent. These 4096-dimensional vectors get reduced to 50 via PCA (you can go as low as 20 without meaningful loss), then scored by the probe. The entire scoring step is a single matrix multiply.

The pipeline runs locally on Apple Silicon via MLX with 4-bit quantised models. No cloud GPUs needed. Though one of the next questions to be answered if whether the smaller models can be used to flag jailbreaks for larger models.

Results

Model: Mistral-7B-Instruct-v0.3 (4-bit quantised via MLX) | Layer: 18 | PCA: 50 dims | Test set: 150 benign + 350 jailbreak prompts (held-out)

The supervised linear probe with just 5 labelled examples already reached 0.996 AUROC.

For context, the best unsupervised method (autoencoder) reached 0.869 AUROC.

Full Detector Comparison

Score distributions and ROC curves

Linear probe score distribution — benign prompts cluster near 0, jailbreaks near 1, with clean separation at the 0.550 threshold.

Precision-recall curve (linear probe).

ROC curve comparison — all 8 detectors overlaid.

What I found interesting

The jailbreak boundary is linear. A Multi-Layer Perceptron (MLP) probe with considerably more capacity scored 0.997 — no improvement over logistic regression. The separation in activation space is a straight line.

The signal appears early. Layer 4 out of 32 gives 0.981 AUROC. By layer 10 it's at 0.998+. The model recognises jailbreak intent almost immediately.

It generalises across attack types. Training on one jailbreak category and testing on entirely different ones (i.e. DAN-style or Encoding jailbreaks): 0.996–0.999 AUROC. There seems to be an identifiable "jailbreak direction" in activation space that's consistent regardless of technique.

Ensembling reduced performance. Combining all detectors dropped performance from 0.999 to 0.967. The weaker methods just added noise in the cases I tested.

Jailbreak signal has low dimensionality. The probe is remarkably insensitive to how aggressively you compress the activations. Raw 4096-dimensional activations score 0.9995 AUROC. PCA down to 50 dimensions: 0.9995. Down to 20: 0.9991. Even 10 dimensions still gives 0.996. The jailbreak signal is low-dimensional — most of those 4096 features are noise as far as detection is concerned.

Labelled training data is very efficient. Five labelled jailbreaks give 0.994 AUROC. Thirty gets you to 0.997. Seventy-five reaches 0.9995. For comparison, the best unsupervised methods without any labels top out at 0.812 (autoencoder) and 0.695 (PCA + Mahalanobis). The curve is nearly flat from 5 to 30 labels — the probe identifies the jailbreak direction from the first few examples and additional labels offer diminishing returns.

Background

This project was influenced by Anthropic's research direction on applying anomaly detection to model latent activations to flag out-of-distribution inputs like jailbreaks. The approach builds on the Cheap Monitors paper (Cunningham et al., 2025), which showed that reusing a model's own intermediate representations for safety classification can match dedicated classifiers at a fraction of the cost.

It started as a purely unsupervised system, to see if labelled data could be eschewed. But adding even a handful of labelled examples made the unsupervised approach redundant, so the focus shifted to understanding how few labels you actually need and how well the probe generalises.

Next Steps

Weak-to-strong transfer. The jailbreak direction learned from Mistral-7B works across attack types. The next question is whether it transfers across models — train a probe on a small model, deploy it on a larger or architecturally different one. This is the unexplored problem setting Anthropic flagged in their research direction. The probe is a single weight vector, so only a small test set of activations from the target model is needed.

Adaptive attacks. An adversary aware of the probe could craft inputs that project low on the jailbreak direction while still functioning as jailbreaks. Testing the robustness of the linear boundary to this, and whether iterative adversarial training can shore it up, is an open problem.

Tech

MLX on M4 Pro, HDF5 for activation storage, Streamlit dashboard for the interactive pipeline. Full results and methodology are documented in the repo.

Page updated

Google Sites

Report abuse