Attention is Not What You Need

Summary:

This paper views attention as:

a particular instance of tensor lifting: a hidden vector is mapped into a high-dimensional space of pairwise interactions, and learning proceeds by constraining this lifted tensor through gradient descent.

As an alternative, they propose “an attention-free sequence model built around Grassmann flows.”

Instead of lifting from token space to pairwise interaction space, consider lifting to a Grassman manifold. The hidden states are points on this manifold. Each forward pass traces a path on this manifold. This path is a flow that we can learn.

Given vectors $Z_t$ and $Z_{t+\Delta}$, consider the 2D subspace spanned by these vectors. Then project back to the model dimension and aggregate (avg.) across offsets $\Delta$. Following this, concatenate the original hidden state and the “Grassman feature” and “compute a gate” (apply linear layer) followed by layer norm and dropout.

Thoughts:

To me, this is one of the most exciting papers in recent times. It relates very closely to a lot of the ideas I’ve been floating around regarding using multivectors and geometric algebra instead of being constrained to simple vectors as is standard practice.

In a (very) abstract way, I think conceptualizing language as a flow/path in the space of token pairs also lends itself slightly to the classic metaphor: “The flow of conversation”

I would be very curious to see this idea expanded to higher levels of abstraction beyond token pairs. I think it would be very interesting to try and model the flow of a paragraph in sentence pair space for example compared to the flow of a sentence in token pair space.

Maybe these could both be modeled as a single flow in a higher dimensional space since if a sentence is the flow of tokens, and a paragraph is the flow of sentences then really a paragraph is also a flow of tokens, but it’d be useful to “visualize” or operate at different granularities of this flow¹.

Looking forward to other research in this area, and seeing what other people might do with it.

This actually reminds me a little of the Coastline Paradox, though I don’t think its particularly useful here. ↩︎

Summary:#

Thoughts:#

Summary:

Thoughts: