06-30-2026
This work was funded by Paradigm.
Transformers exhibit remarkable associative recall (AR) capabilities: attention gives each token direct access to its predecessor, a mechanism that has been hard to match for other architectures, such as recurrent neural networks (RNNs).
But for some domains, we cannot afford the quadratic-attention overhead of Transformer. An example is long-horizon RL in the style of a dreamer. For these types of applications, we need to make recurrent neural networks work, but do not want to sacrifice associative recall.
The most famous RNN for associative recall is MLSTM, which is a variant of LSTM that maintains matrix memory. MLSTMs exhibit significantly better recall over the baseline on a benchmark, MQAR. But pure recall may not be sufficient to measure recall performance. In areas where environmental changes may be noisy, a useful proxy test is noisy associative recall (NAR).
Since MQAR does not measure NAR, we can look at MAD’s noisy AR task suite. Here’s an example of what a task looks like:
0 9 3 10 12 13 15 14 0 9 5 8 2 9
here, key 0 map for price 9key 3 map for price 10Etcetera. The MAD generator uses different token categories for keys, values, and distractors. so if the keys are 0-5then token 12-15 They are distracting. Good models should predict NAR 9 10th place, see 0 -> 9 Initially, ignoring the interleaved distractor tokens.
So how do we improve recurring NARs? We can borrow some ideas from Muon, an optimizer that has been highly successful for language modeling. The muon orthogonalizes its momentum, acting as an equalizer of the represented directions. This prevents a few strong directions from dominating the update, and elevates weaker directions. Particularly relevant is recent research showing that Muon outperforms Adam in tail-end associative memory learning. The idea is that this similarity prevents weaker memories from leaking out.
Motivated by this, we decided to test whether orthogonalizing the MLSTM memory matrix during reading and training with this additional process improves NAR performance.
We compare MLSTM baselines to their orthogonalized variants on next-token prediction using MAD noisy AR samples. We use MAD noisy-recall for training and evaluation frac_noise set to 0.8 In a range of vocabulary sizes and sequence lengths. All models were trained using AdamW (betas = 0.9, 0.999, weight_decay = 0.01) for 2k steps at a batch size of 64. The learning rate was chosen by sweeping 3e-4, 1e-3, 3e-3And 1e-2 For each task setup.
We prepare a new batch for training at each step, and maintain a separate fixed validation set per experiment. For orthogonalization, we normalize by the Frobenius criterion (eps = 1e-6) and apply five Newton–Schulze iterations. We allow gradients to flow through the process. Importantly, we do not write back orthogonalized memory, because we get this poor performance. We use it only for readout. Fully reproducible code for our experiments can be found here.


\$ text{Glossary 80, lane 768} and 91.7 \pm 11.4\ (22/24) and 75.9 \pm 12.0\ (13/24) and +15.7 \pm 16.8\ text{Glossary 80, lane 1024} and 98.5 \pm 2.4\ (23/24) and 83.3 \pm and 68.5 \pm 18.3\ (16/24) and 23.1 \pm 15.3\ (4/24) and +45.4 \pm 18.6\ hline \end{array} $$
We found that orthogonalization improves success rates and mean accuracy across the board. Interestingly, the difference increases as we enter the Vocab-96 regime, suggesting that orthogonalization helps most for difficult NAR tasks where raw MLSTMs struggle. In the latter two cases (Vocab 96, Seek Lanes 768/1024), orthogonalization brings MLSTM from the brink of failure (4/24 resolved seeds) to substantially more reliable performance (14–16 resolved seeds). This is amazing for a small intervention we wanted to do. Newton-Schulze gives us additional gains on fixed parameter calculations, trading off extra FLOPs and wall-clock time.
We should be careful not to pay too much attention to these results. They live in a small model regime, and NAR is a synthetic function. It would be worth investigating whether the NAR advantage translates into advantage in real-world benchmarks for larger models.
Thanks to Dan Robinson, Alpin Yukseloglou, and Glenn Taggart for feedback and suggestions while writing this post.
<a href