Researchers from Stanford University have published a paper introducing Linear Recurrence Networks (LRNs), an architecture that replaces the quadratic-attention mechanism at the core of every modern language model with an O(n) linear recurrence operation.
The result: models that achieve 95% of transformer quality while using 80% less memory during inference and 60% less during training. On a single A100 GPU, the paper demonstrates that a 70B-parameter LRN can process 1M tokens in a single forward pass — something that would require 8 A100s with a standard transformer.
How It Works
LRNs replace the self-attention layer with a structured linear recurrence that maintains a compressed “memory state” as it processes each token. This state grows logarithmically rather than linearly with sequence length, giving the model the ability to reference earlier parts of long contexts without the O(n²) cost.
The key insight is using a “selective recurrence” mechanism that learns which information to keep in the compressed state and which to discard — analogous to how attention learns which tokens to focus on, but with fixed computational cost.
Benchmark Results
On standard language modeling benchmarks, 70B LRN models match 70B transformer baselines within 1-2% on MMLU, HumanEval, and GSM8K. On long-context tasks (>32K tokens), LRNs actually outperform transformers by 3-5%, likely because the recurrence mechanism handles long-range dependencies more naturally.
Open Source
The researchers have released training code, model weights for several sizes (1B, 7B, 13B), and evaluation scripts on GitHub under the Apache 2.0 license. Several AI labs, including Mistral and Nous Research, have already begun experimenting with the architecture.