Ml | Aelias

mHC: Manifold-Constrained Hyper-Connections

ArXiV and excellent video explainer from Jia-Bin Huang. Summary: This paper extends on the [[Hyper-Connections|Hyper Connections]] paper. Hyper connections were introduced as a way to essentially apply the idea of a mixture of experts (MoE) to residual connections. Instead of having a single residual connection, let’s have multiple different residual connections each with a linear mapping to “extract” different parts of the main layer. Manifold-Constrained Hyper Connections improves upon this by improving training stability. They do this by constraining the linear layer applied to the residual connection “heads” to a doubly stochastic matrix that way all values sum to 1 which should prevent exploding or vanishing gradients. ...

Attention is Not What You Need

Original Paper Summary: This paper views attention as: a particular instance of tensor lifting: a hidden vector is mapped into a high-dimensional space of pairwise interactions, and learning proceeds by constraining this lifted tensor through gradient descent. As an alternative, they propose “an attention-free sequence model built around Grassmann flows.” Instead of lifting from token space to pairwise interaction space, consider lifting to a Grassman manifold. The hidden states are points on this manifold. Each forward pass traces a path on this manifold. This path is a flow that we can learn. ...

Proposal for a System 2 Model

Introduction I recently read this1 tweet, which referenced the LLaDa paper (Large Language Diffusion Models). And with Inception’s recent announcement of their Mercury models, I’ve been thinking a lot about Diffusion architectures for large language models, and I’m really excited about them. I harbor very similar sentiments that diffusion feels far more analogous to how humans think compared to the transformer-based token autoregressive models (ARMs) that currently dominate the field. Instead of simply predicting what the next token should be, us humans ideate some concept as a whole and then put it into words (or some other medium). Using a diffusion model is much closer to this as it effectively generates all the tokens together in parallel instead of one at a time. ...

Tversky Neural Networks

ArXiV Summary Amos Tversky’s seminal paper Features of Similarity, introduced a similarity metric for how humans perceive similarity. Tversky’s similarity metric is non-symetric as human preferences empirically are. By using this non-symmetric metric in neural networks, its able to more-easily learn non-linear functions such as XOR. The primary similarity function they introduce is: $$ S(a,b) = \theta f(A\cap B) - \alpha f(A - B) - \beta f(B - A) $$There are several parts to this. ...