Transformer tricks: Removing weights for skipless transformers

Nils Graef

doi:10.36227/techrxiv.171174284.41350753/v1

loading page

Transformer tricks: Removing weights for skipless transformers

Nils Graef

Abstract

He and Hofmann [1] detailed a skipless transformer without the V and P (post-attention projection) linear layers, which reduces the total number of weights. However, this scheme is only applicable to MHA (multi-head attention) [2], but not for MQA (multi-query attention) [3] and GQA (groupedquery attention) [4]. The latter schemes are used by many popular LLMs such as Llama 2, Mistral, Mixtral, PaLM, and Gemma [5, 6, 7, 8, 9]. Therefore, this micro-paper [10] proposes mathematically equivalent versions that are suitable for MQA and GQA. For example, removing Q and P from a skipless version of Mistral-7B would remove 15% of its weights (and thus reduce its compute and memory complexity). See [11, 12] for code and more transformer tricks.

21 Mar 2024Submitted to TechRxiv

29 Mar 2024Published in TechRxiv

Abstract

Peer review timeline