Same Model, Same Prompt, Two Answers: How GPU Precision Silently Breaks LLM Safety

TL;DR

A paper posted to arXiv on April 2, 2026 — Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements — introduces PrecisionDiff, a framework that finds prompts where the same aligned LLM refuses under one numerical precision (BF16) and produces harmful output under another (FP16 or INT8). Success rates: 84.0% BF16 vs FP16, 83.4% BF16 vs INT16, 99.5% INT16 vs INT8. None of the major safety benchmarks (HarmBench, JailbreakBench, AILuminate) log the precision they ran at — so their numbers quietly assume a property the models don't have.

What's new

For years, safety evals have treated a model as a fixed object: you hand it a prompt, you get a refusal or a jailbreak, you record a pass/fail. The researchers — Yifei Wang, Tianlin Li, Xiaohan Zhang, Xiaoyu Zhang, Wei Ma, Mingfei Cheng and Li Pan from SJTU, Beihang, NTU Singapore and SMU — kill that assumption.

Their insight: floating-point arithmetic on GPUs is not associative. BF16, FP16, INT16 and INT8 all represent the same tensor differently. Those tiny differences — usually hidden behind a greedy-decoding identical-looking generation — accumulate across 32 transformer layers. At the final logit, the gap between "refuse" and "comply" can be 0.01. Precision noise alone is enough to flip it.

PrecisionDiff weaponizes this with a dual-precision momentum-guided GCG optimizer: it searches for adversarial suffixes that minimize the refusal loss at one precision while maximizing refusal at another. The result is prompts that are safe on a vLLM BF16 server and unsafe on a quantized consumer-GPU deployment — with the exact same weights.

Why it matters

Three uncomfortable consequences fall out.

Every published refusal rate is precision-dependent. A model card that says "98% refusal on HarmBench" never tells you whether that run was BF16 on an H100, FP16 on an A100, or INT8 on an edge device. Under PrecisionDiff, those numbers can diverge by double digits.
Deployment teams now hold a safety lever they didn't know they had. Choosing FP16 over BF16 for latency or cost silently changes the model's alignment behaviour. That is a governance problem: the entity changing safety is not the entity that trained it.
Red-teamers get a new surface. An attacker can craft a prompt that only works when the target happens to be quantized — common on consumer GPUs, mobile NPUs, and bargain-tier inference APIs — leaving cloud BF16 audits squeaky clean.

Technical facts

The headline numbers from the paper:

Precision Pair	Disagreement Success Rate	Avg. Iterations
BF16 vs FP16	84.0%	63.2
BF16 vs INT16	83.4%	49.0
INT16 vs INT8	99.5%	20.7

Against vanilla GCG as a baseline, PrecisionDiff improves attack-success rates by 1.4× to 8.5×:

Llama-2-7B: 68.0% (PrecisionDiff) vs 8.0% (GCG)
Vicuna-7B-v1.5: 72.0% vs 50.0%
Fuzzing / genetic-algorithm baselines: 2–8%

By harm category, cybercrime and malware prompts flip across precisions at over 90%; misinformation sits at 71.2%. The root-cause analysis — using Mean Absolute Difference and Relative Divergence Lift — localizes the amplification to three hotspots: input-stage token embeddings, attention projections (W_Q, W_K), and the output norm / LM head.

Comparison

This work sits in a three-way conversation with existing literature:

vs. classical jailbreak research (GCG, PAIR, AutoDAN) — those assume one fixed deployment. PrecisionDiff treats dtype as part of the attack surface.
vs. quantization-jailbreak work (Quantization Contrastive Jailbreak, Q-resafe) — those focus on aggressive 4/8-bit quantization. PrecisionDiff shows the problem already exists between BF16 and FP16, the two dtypes every production shop actually uses (BF16 training, FP16 inference).
vs. reproducibility work — Give Me FP32 or Give Me Death? (June 2025) documented that BF16 inference is non-deterministic across GPU SKUs, batch sizes, and tensor-parallel degrees. PrecisionDiff is what happens when you turn that non-determinism into an adversarial objective.

Use cases

Who should care, concretely:

Safety evaluators and benchmark maintainers — any eval that doesn't log torch.dtype, GPU SKU, attention kernel (FlashAttention vs xFormers), and tensor-parallel degree is under-specified. Expect HarmBench, JailbreakBench, and AILuminate to add a precision axis.
Inference providers (vLLM, SGLang, TGI, TensorRT-LLM) — dtype is no longer only a perf/cost knob; it is a safety knob. System cards and serving configs need to be explicit.
Frontier labs running RLHF or DPO in BF16 and serving in FP16 — "safety trained in" is not "safety served."
Regulators (EU AI Act, US EO 14110) — a model's safety posture is partly a property of the deployer, not just the trainer. Compliance tests need precision-stratified evidence.
Red-teamers — precision-specific prompts are a new, low-noise attack vector.

Limitations & pricing

Honest caveats:

All evaluated models are open-weight 7B–8B (Llama-2-7B, Llama-3-8B, Vicuna-7B-v1.5, Mistral-7B-Instruct-v0.2, Guanaco-7B). Frontier closed models weren't tested; their precisions aren't user-controllable anyway.
The attack is white-box — dual-precision GCG needs gradient access. Black-box transferability of precision-divergent prompts isn't demonstrated at scale yet.
Optimization cost: 49–63 iterations on average. Comparable to plain GCG; not free but not exotic.
The paper localizes where divergence happens; it does not yet propose a training-time fix. Q-resafe-style precision-aware safety patching is flagged as a candidate direction.

Availability: academic preprint, free to read. Evaluation data and anonymized artifacts hosted on Zenodo under DOI 10.5281/zenodo.19250143. Non-anonymized GitHub code release promised on formal acceptance.

What's next

A few things to watch through the rest of 2026:

Whether HarmBench, JailbreakBench and MLCommons AILuminate add precision as a reported axis in their next refresh.
Whether frontier labs publish precision-stratified refusal rates in system cards — today, none do.
Defensive work: adversarial training that explicitly covers precision pairs, or safety fine-tuning that enforces logit-level invariance across dtypes.
Extensions into MoE routing precision and long-context attention kernels, where floating-point non-determinism is known to be larger.

The uncomfortable takeaway is simple: alignment is a property of model weights plus the numerical pipeline that executes them. Treating it as a property of weights alone was a convenient fiction. That fiction just got a formal disproof.

Nguồn: arXiv:2604.19790, arXiv:2506.09501, @sukh_saroy on X.

Same Model, Same Prompt, Two Answers: How GPU Precision Silently Breaks LLM Safety

TL;DR

What's new

Why it matters

Technical facts

Comparison

Use cases

Limitations & pricing

What's next

Tiếp tục lướt

Ouroboros: dạy mô hình nhỏ "suy nghĩ sâu" bằng cách lặp một lớp với hypernetwork

Aletheia: AI của Google DeepMind giải 6/10 bài toán mới và đạt 91.9% IMO-ProofBench

The LLM Judge Goes Soft: A Single Sentence Breaks 2 Years of AI Safety Evals