mHC: Manifold-Constrained Hyper-Connections
ArXiV and excellent video explainer from Jia-Bin Huang. Summary: This paper extends on the [[Hyper-Connections|Hyper Connections]] paper. Hyper connections were introduced as a way to essentially apply the idea of a mixture of experts (MoE) to residual connections. Instead of having a single residual connection, let’s have multiple different residual connections each with a linear mapping to “extract” different parts of the main layer. Manifold-Constrained Hyper Connections improves upon this by improving training stability. They do this by constraining the linear layer applied to the residual connection “heads” to a doubly stochastic matrix that way all values sum to 1 which should prevent exploding or vanishing gradients. ...