LUCID: Attention with Preconditioned Representations

TL;DR

The condition number of the key correlation matrix grows with sequence length, making softmax attention increasingly noisy at long contexts. Tuning the attention entropy can't fix this — lower entropy causes vanishing gradients, higher entropy causes representation collapse. LUCID preconditions the key correlations to decorrelate keys without changing the entropy: a drop-in replacement with up to 14% improvement on RULER multi-needle retrieval, 18% improvement on BABILong long-context reasoning, and ~1.3% inference overhead.

The Problem: Attention Noise at Long Contexts

Softmax attention computes a probability distribution over all keys for each query. As the sequence length $n$ grows, this distribution must cover more and more tokens. Even if most tokens are irrelevant, softmax still assigns them non-zero probability — it cannot output exact zeros.

This is the attention dilution problem: the probability mass that should concentrate on a handful of relevant tokens gets spread thin across thousands of irrelevant ones. For tasks like needle-in-a-haystack retrieval at 128K context, this dilution directly degrades performance.

The Entropy Paradox

Standard attention computes:

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right) V$$

The entropy of the attention distribution $a = \text{softmax}(q^\top [k_1, \ldots, k_N] / \sqrt{d})$ is determined by how the learned $Q$ and $K$ projections shape the logits. This entropy must land in a narrow operating zone:

Low entropy (sharp, peaked distributions) enables precise retrieval — but creates vanishing gradients. As the distribution approaches one-hot, the softmax Jacobian approaches zero everywhere. The model can no longer learn [8, 9].

High entropy (flat, uniform distributions) keeps gradients healthy — but causes representation collapse [9]. All queries retrieve nearly the same weighted average, destroying the model's ability to distinguish between tokens.

As context grows, attention tends to drift into the high-entropy zone — a phenomenon observed in prior work like Differential Transformer [6]. LUCID's key insight is that we shouldn't be fighting entropy at all; the root problem lies elsewhere.

Softmax attention dilutes as context grows. Lowering entropy sharpens focus but kills gradients; raising it preserves gradients but collapses representations. Neither direction scales.

Root Cause: Correlated Keys

To understand why attention dilutes, consider what softmax actually computes. The exponential kernel $\exp(\langle \cdot, \cdot \rangle)$ induces a Reproducing Kernel Hilbert Space (RKHS) with feature map $\phi$, where inner products in this space equal exponentiated inner products in the original space. The attention weights are determined by the key correlation matrix in this RKHS:

$$G_{ij} = \exp\!\left(\frac{k_i^\top k_j}{\sqrt{d}}\right) = \langle \phi(k_i), \phi(k_j) \rangle$$

A critical property: since $G_{ij}$ is an exponentiated inner product, $G_{ij} > 0$ for all $i, j$. Every pair of keys is always positively correlated in the RKHS. There are no negative or zero entries — the feature-space keys $\phi(k)$ cannot be orthogonal.

The condition number $\kappa(G)$ of this matrix measures how correlated the keys are. When $\kappa$ is large, many keys point in similar directions in the RKHS, making it hard for softmax to distinguish between them. As sequence length increases, the condition number grows because more keys packed into a finite-dimensional space inevitably become more correlated. This is the fundamental cause of attention dilution.

Condition number of the LUCID preconditioner matrix grows with sequence length, showing power-law growth from 2K to 65K tokens. — Condition number $\kappa$ of the preconditioner matrix grows with sequence length. Higher $\kappa$ indicates stronger key correlations, where LUCID's correction becomes more essential. Measured on ~1B parameter models (hidden size 2048, 24 layers, 32 heads) during continual pretraining on Dolma [17], finetuned from 2K to 65K.

The condition number grows with sequence length due to accumulating key correlations. This empirically validates that LUCID's advantage increases for longer sequences where key correlations are worst.

The root cause is geometry, not entropy. In the exponential kernel RKHS, keys are always positively correlated ($G_{ij} > 0$). As sequence length grows, $\kappa(G)$ increases and attention noise worsens — this is what LUCID directly fixes.

The LUCID Method

Softmax Attention as Gradient Descent in RKHS

Following Katharopoulos et al. [2] ("Transformers are RNNs"), the exponential inner product in softmax attention can be expressed using a kernel function:

$$\exp\!\left(\langle q_i, k_j \rangle\right) = \langle \phi(q_i), \phi(k_j) \rangle$$

where $\phi: \mathbb{R}^d \to \mathcal{H}$ is a feature map to a Reproducing Kernel Hilbert Space (RKHS). Using this view, we can interpret unnormalized softmax attention as gradient descent on a linear objective. At each step $t$, we maintain a state matrix $S: \mathcal{H} \to \mathbb{R}^d$ that stores key-value associations:

$$f_t(S) = -v_t^\top S\,\phi(k_t)$$

The gradient $\nabla_S f_t(S) = -v_t\,\phi(k_t)^\top$ yields the additive update:

$$S_t = S_{t-1} + v_t\,\phi(k_t)^\top$$

This is precisely the update underlying standard linear attention. But the linear objective has fundamental limitations: it is unbounded from below, the updates are state-independent (occurring regardless of whether the association is already stored), and each update accumulates interference without removing old information.

Quadratic Objective → Delta Rule in RKHS

Following Yang et al. [3] (DeltaNet, arXiv:2406.06484), a more principled approach uses a quadratic objective that directly measures retrieval error:

$$f_t(S) = \frac{1}{2}\|S\,\phi(k_t) - v_t\|^2$$

This objective is bounded below by 0, with a clear minimum at $S\,\phi(k_t) = v_t$. The gradient descent update with step size $\beta_t = 1$ yields the delta rule in RKHS:

$$S_t = S_{t-1}(I - \phi(k_t)\,\phi(k_t)^\top) + v_t\,\phi(k_t)^\top$$

This is an erase-then-write mechanism:

$$S_t = S_{t-1} - \underbrace{(S_{t-1}\phi(k_t))\,\phi(k_t)^\top}_{\text{erase old association}} + \underbrace{v_t\,\phi(k_t)^\top}_{\text{write new association}}$$

A key advantage is self-regulation: when the current prediction is already correct ($S_{t-1}\phi(k_t) = v_t$), the gradient is zero and no update occurs. This property is absent in the linear objective, which blindly updates regardless.

Why RKHS, Not Key Space? (LUCID vs DeltaNet)

The delta rule update above is precisely DeltaNet [3, 4], but generalized from finite-dimensional key space to the infinite-dimensional RKHS induced by the exponential kernel. The key differences:

DeltaNet: Uses the identity map $\phi(x) = x$, operating in the $d$-dimensional token space. The preconditioner $(I + \text{stril}(KK^\top))^{-1}$ decorrelates keys in this finite-dimensional space. When keys are orthogonal ($k_i^\top k_j = 0$), DeltaNet's correction vanishes.
LUCID: Uses the exponential kernel feature map $\phi: \mathbb{R}^d \to \mathcal{H}$, operating in an infinite-dimensional RKHS. The preconditioner $(M \circ \exp(KK^\top))^{-1}$ decorrelates keys in this richer space. Since $\exp(k_i^\top k_j) > 0$ always, LUCID's correction never vanishes.

This distinction matters at long contexts. Keys live in a finite head dimension (e.g., $d = 128$), so at sequence lengths like 128K, the $d$-dimensional key space is far too small to accommodate 128K nearly-orthogonal vectors — keys inevitably interfere, and DeltaNet's correction in this crowded space is limited. The RKHS induced by the exponential kernel, however, is infinite-dimensional. The feature-space representations $\phi(k)$ have the potential to be far less correlated in this richer space, and LUCID's preconditioner exploits this by decorrelating them, enabling precise retrieval even when the original keys are highly entangled.

Why infinite dimensions matter: You can't fit 128K nearly-orthogonal vectors in a 128-dimensional key space — interference is inevitable. But in the infinite-dimensional RKHS, the feature-space keys $\phi(k)$ have room to be decorrelated by LUCID's preconditioner.

The Key Equation

The recurrent delta rule can be computed in parallel. Collecting outputs for all tokens and introducing standard $1/\sqrt{d}$ logit scaling plus RMS normalization for the keys inside the preconditioner ($k_{i,\text{RN}} \leftarrow \sqrt{d} \cdot k_i / \|k_i\|_2$), we arrive at the LUCID attention formula:

$$\text{LUCID}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}} + \hat{M}\right) \cdot \left(M \circ \exp\!\left(\frac{K_{\text{RN}}K_{\text{RN}}^\top}{\sqrt{d}} - \sqrt{d}\right)\right)^{\!-1} V$$

where $K_{\text{RN}}$ denotes the RMS-normalized keys, $M$ is the binary causal mask ($M_{ij} = 1$ if $i \ge j$), and $\hat{M}$ is its additive form ($\hat{M}_{ij} = 0$ if $i \ge j$, $-\infty$ otherwise). The middle term is the preconditioner — it inverts the key correlation structure, decorrelating the keys before the attention output is computed. The RMS normalization ensures unit diagonal entries and controlled off-diagonal magnitudes, yielding better condition numbers.

Practical Implementation

The intention is to precondition the keys and thereby the attention probabilities formed by query-key correlations. However, we found that multiplying the preconditioner with the values $V$ instead is mathematically equivalent — since matrix multiplication is associative, $(\text{softmax}(\cdots) \cdot P^{-1}) V = \text{softmax}(\cdots) \cdot (P^{-1} V)$. This reformulation has two practical advantages:

One preconditioner per KV head: The preconditioner $P$ depends only on keys, so it is shared across all query heads in the same KV group. Instead of preconditioning every query head's attention weights, we precondition the values once per KV head.
Linear solve instead of matrix inverse: Rather than explicitly computing $P^{-1}$ and multiplying, we solve the linear system $PY = V$. Since $P$ is lower triangular (thanks to the causal mask $M$), this is just forward substitution — $O(N^2 d)$ complexity, the same as standard attention.

In practice, we use torch.linalg.solve_triangular, which dispatches to the cuBLAS TRSM (Triangular Solve) kernel — a highly optimized divide-and-conquer routine on GPU. Modern LLMs use Grouped-Query Attention (GQA) with far fewer KV heads than query heads (e.g., 4 KV heads for 32 query heads in our 1B model), so the triangular solve is only performed 4 times, not 32, keeping overhead minimal:

Training overhead: ~0–5.5% depending on architecture (0% for Gemma 3-1B, 3.5% for Gemma 3-4B, 5.5% for Qwen 2.5-1.5B)
Inference overhead: ~1.3% (77ms vs 76ms at 32K context with 100 new tokens)

A compute ablation confirms these gains are architectural, not from extra compute: training the baseline 10% longer does not match LUCID's multi-needle retrieval performance.

LUCID = DeltaNet in RKHS. It arises from optimizing a quadratic retrieval-error objective instead of the linear objective underlying standard attention. The resulting delta rule operates in the infinite-dimensional RKHS where 128-dim keys can't hide correlations, and the preconditioner is computed efficiently via forward substitution (cuBLAS TRSM) with 0–5.5% training overhead.

Learnability

The Softmax Gradient Problem

A critical requirement for attention mechanisms is the ability to represent sharp distributions — when a query $q_i$ matches a specific key $k_j$, attention should concentrate on the corresponding value $v_j$. In standard softmax, the way to achieve this is to lower the temperature, pushing the output toward a one-hot vector:

$$\text{softmax}(z / \tau) \xrightarrow{\tau \to 0} e_{\arg\max z}$$

But this kills learning. The softmax Jacobian is:

$$J = \frac{\partial a}{\partial \tilde{a}} = \text{diag}(a) - aa^\top$$

When $a = e_i$ (one-hot), we get $J = \text{diag}(e_i) - e_i e_i^\top = 0$. Zero Jacobian means zero gradients — learning stops completely. This creates a fundamental dilemma: standard attention can either retrieve precisely (low temperature, no gradients) or learn effectively (higher temperature, blurred retrieval).

LUCID Decouples Retrieval from Entropy

LUCID resolves this tension by achieving sharp retrieval through the preconditioner $(M \circ \exp(KK^\top))^{-1}$ rather than by lowering the softmax temperature. The softmax operates at standard temperature with a well-conditioned gradient, while the preconditioner sharpens the final output through deconvolution. This fundamental decoupling enables LUCID to achieve both precise retrieval and effective learning simultaneously.

Theorem (Gradient Preservation). Let $o$ be the LUCID attention output (before multiplying by $V$). Assume $K \neq 0$ and at least one column of $\text{diag}(a) - aa^\top$ is not in the null-space of $K^\top$, where $a = \text{softmax}(qK^\top / \sqrt{d})$. Then $\partial o / \partial q \neq 0$.

The proof follows from the Jacobian decomposition:

$$\frac{\partial o}{\partial q} = \frac{K^\top}{\sqrt{d}} \left(\text{diag}(a) - aa^\top\right) \left(M \circ \exp\!\left(\frac{K_{\text{RN}}K_{\text{RN}}^\top}{\sqrt{d}} - \sqrt{d}\right)\right)^{\!-1}$$

Since the preconditioner is a lower-triangular matrix with positive diagonal entries, it is invertible with trivial null-space. The gradient can only vanish if the softmax becomes one-hot — but LUCID doesn't need low temperature, so this doesn't happen.

Synthetic Experiment: Sequential Task Learning

To empirically validate this, we designed a two-phase experiment on a single-layer transformer (dim 256, 1 head, sequences of length 10):

Phase 1 (Self-Retrieval): Learn to copy the input ($y_i = x_i$). Requires sharp, identity-like attention.
Phase 2 (Cumulative Averaging): Without resetting weights, switch to computing $y_i = \frac{1}{i}\sum_{j=1}^{i} x_j$. Requires adapting from sparse to dense distributions.

Sequential task learning experiment. Left: training loss; both methods solve Phase 1, but only LUCID adapts to Phase 2. Right: off-diagonal Jacobian magnitude; softmax reduces it by ~1000x during Phase 1, blocking gradient flow in Phase 2. — **Left:** Training loss across two phases. Both methods solve Phase 1 (self-retrieval), but only LUCID adapts to Phase 2 (cumulative averaging). **Right:** Off-diagonal Jacobian magnitude (log scale). Standard softmax reduces its Jacobian by ~$10^3\times$ during Phase 1 to achieve sharpness, blocking gradient flow in Phase 2. LUCID maintains higher Jacobian values throughout, enabling rapid adaptation.

During Phase 1, both methods achieve near-zero loss. But standard softmax achieves sharpness by progressively reducing its Jacobian $(\text{diag}(a) - aa^\top)$ by approximately three orders of magnitude — effectively lowering its implicit temperature. When Phase 2 arrives, this near-zero Jacobian blocks gradient flow and the model cannot adapt. LUCID, having achieved sharpness through its preconditioner while maintaining higher Jacobian magnitudes, rapidly adapts to the new task.

LUCID decouples retrieval sharpness from softmax entropy. The preconditioner provides precision while the softmax operates at standard entropy, preserving gradient flow. Standard attention must choose between sharp retrieval and effective learning — LUCID gets both.

Results

LUCID is evaluated on ~1B parameter language models (22 layers, model dim 2048, 32 query heads, 4 KV heads) with context windows up to 128K tokens. It is a drop-in replacement for standard attention — no architectural changes, no additional parameters.

Multi-Needle Retrieval (RULER)

On the RULER [10] multi-needle retrieval benchmark:

MNIAH heatmaps: Standard Attention degrades sharply with more needles and longer sequences; LUCID maintains substantially higher accuracy across all settings; the difference map shows 10-26% improvements. — **Left:** Standard Attention accuracy degrades sharply as task difficulty increases (more needles, longer sequences), dropping to 11.4% in the hardest configuration. **Middle:** LUCID Attention maintains substantially higher accuracy. **Right:** Consistent improvements of 10–26%, with LUCID providing the largest gains at longer sequence lengths.

Multi-needle retrieval accuracy improves with longer finetuning for LUCID: relative improvement grows from +19.8% at 32K finetuning to +47.3% at 64K finetuning. — Multi-needle retrieval accuracy vs. finetuning sequence length. The relative improvement of LUCID over Standard Attention increases from +19.8% (32K finetuning) to **+47.3%** (64K finetuning). LUCID's gains scale with context length — exactly where standard attention struggles most.

BABILong

BABILong [13] tests multi-hop fact retrieval and reasoning across long contexts (32K–128K tokens). We introduce LUCID-PaTH, which combines LUCID's key decorrelation with PaTH positional encoding [7] for length extrapolation.

LongBench & SCROLLS

On the LongBench [11] and SCROLLS [16] real-world long-document benchmarks (multi-document QA, single-document QA, summarization), LUCID and LUCID-PaTH achieve the best performance on 4 of 6 tasks. Linear attention variants (DeltaNet, GLA, GSA) underperform substantially — e.g., DeltaNet achieves only 0.036 F1 on 2WikiMQA compared to 0.274 for LUCID.

Attention Hitrate

To directly measure whether LUCID reduces attention noise, we measure the attention hitrate — the fraction of attention weight placed on semantically relevant ("needle") tokens during retrieval:

Standard Attention: 0.1817
LUCID Attention: 0.2845 — a 56.6% improvement

This directly confirms that LUCID's preconditioner concentrates probability mass on the right tokens.

LUCID is a drop-in replacement with gains that scale with context length. Up to 14% improvement on RULER multi-needle (47% relative improvement at 64K finetuning), stable accuracy on BABILong at 128K where baselines collapse, and 56.6% better attention hitrate — all with ~1.3% inference overhead.

Want to Work Further on LUCID?

(a) Distillation for Foundation Models

Our experiments train from scratch at ~1B scale. A natural next step is to use distillation to create LUCID+PaTH variants of existing foundation models (Qwen, Gemma, Llama, etc.) for ~1M context, where $\kappa$ blows up and LUCID's correction becomes most essential. This would bring LUCID's benefits to production-scale models without full pretraining.

(b) Efficient Kernels

The current implementation uses the cuBLAS TRSM kernel for the triangular solve. Two promising directions for further optimization:

Interleaved solver + FlashAttention: Fuse the triangular solve with the FlashAttention softmax pass into a single kernel to reduce memory traffic.
Neumann series approximation: Write the inverse as $(I - A)^{-1} = I + A + A^2 + \ldots$ and compute the truncated series using FlashAttention-like tiled algorithms. This could enable fully IO-aware implementations.

(c) Bidirectional Models

LUCID's efficient implementation relies on the preconditioner being lower triangular (from the causal mask). For bidirectional settings — such as diffusion models or encoders — the preconditioner loses this structure. One approach: use Newton-Schulz iteration (as in the Muon optimizer) to approximate the inverse, which primarily involves bf16 matrix multiplications and could be made efficient on modern hardware.

Key open directions: distilling LUCID into foundation models for ~1M context, fusing the triangular solver with FlashAttention kernels, and Newton-Schulz iteration for bidirectional settings.

References

Vaswani et al. "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762
Katharopoulos et al. "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." ICML 2020. arXiv:2006.16236
Yang et al. "Parallelizing Linear Transformers with the Delta Rule over Sequence Length." 2024. arXiv:2406.06484
Schlag et al. "Linear Transformers Are Secretly Fast Weight Programmers." ICML 2021. arXiv:2102.11174
Yang et al. "Gated Delta Networks: Improving Mamba2 with Delta Rule." 2024. arXiv:2412.06464
Ye et al. "Differential Transformer." 2025. arXiv:2410.05258
Yang et al. "PaTH Attention: Position Encoding via Accumulating Householder Transformations." 2025. arXiv:2505.16381
Zhai et al. "Stabilizing Transformer Training by Preventing Attention Entropy Collapse." ICML 2023. arXiv:2303.06296
Masarczyk et al. "Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization." 2025. arXiv:2506.01562
Hsieh et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?" 2024. arXiv:2404.06654
Bai et al. "LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding." 2023. arXiv:2308.14508
Brown et al. "Language Models are Few-Shot Learners." NeurIPS 2020.
Kuratov et al. "BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack." NeurIPS 2024. arXiv:2406.10149
Dao & Gu. "Transformers are SSMs: Generalized Models and Efficient Algorithms through Structured State Space Duality." 2024. arXiv:2405.21060
Weston et al. "Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks." 2015. arXiv:1502.05698
Shaham et al. "SCROLLS: Standardized CompaRison Over Long Language Sequences." 2022. arXiv:2201.03533
Soldaini et al. "Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." 2024.
Team et al. "Gemma 3 Technical Report." 2025. arXiv:2503.19786
Team et al. "Qwen2 Technical Report." 2024. arXiv:2407.10671
Su et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." Neurocomputing 2024.
Gao et al. "The Language Model Evaluation Harness." 2024. Zenodo
Zhang et al. "Gated Slot Attention for Efficient Linear-Time Sequence Modeling." 2024. arXiv:2409.07146