Mixture of A Million Experts // new (old) idea of many specialized models

3 min readJul 16, 2024

The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of the finegrained MoE scaling law shows that higher granularity leads to better performance. However, existing MoE models are limited to a small number of experts due to computational and optimization challenges. This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling efficient utilization of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency.

This work introduces the Parameter Efficient Expert Retrieval (PEER) architecture, leveraging product key retrieval (Lample et al., 2019) for efficient routing to an extremely large number of experts, decoupling computational cost from parameter count. This design demonstrates a superior compute-performance tradeoff in our experiments, positioning it as a competitive alternative to dense FFW layers for scaling foundation models. The main contributions of this work are:
• Exploration of Extreme MoE Setting: Deviating from the focus on a small number of large experts in previous MoE research, this work investigates the under-explored case of numerous tiny experts.
• Learned Index Structure for Routing: Demonstrating for the first time that a learned index structure (Kraska et al., 2018) can efficiently route to over a million experts.
• New Layer Design: Combining product key routing with single-neuron experts, we introduce the PEER layer that expands layer capacity without significant computational overheads. Empirical results demonstrate its superior efficiency compared to dense FFW, coarse-grained MoEs and Product Key Memory (PKM) layers.
• Comprehensive Ablation Studies: We investigate the impact of different design choices of PEER such as number of experts, active parameters, number of heads and query batch normalization on language modeling tasks.

— -

Conclusion This work introduces a fine-grained MoE architecture that decomposes an extremely wide dense feedforward layer into a large number of small experts. This design is supported by the recent discovery of the finegrained MoE scaling law. To overcome the computational overhead of routing to a large number of experts, we apply the product keys to efficiently select a small subset of hidden neurons within a wide MLP layer. Empirical analysis using language modeling tasks demonstrate that given the same compute budget, PEER significantly outperforms dense transformers, coarse-grained MoEs and product key memory layers.

https://dev.to/mikeyoung44/mixture-of-a-million-experts-11n5

The training process for PEER involves several key steps:
Dataset Partitioning: The training data is divided into subsets, each of which is assigned to a specific expert model.
Expert Training: Each expert model is trained on its assigned subset of the data, becoming highly specialized in that domain.
Router Training: The router model is trained to select the appropriate expert models for a given input, based on the input’s features and the experts’ specializations.

The “Mixture of A Million Experts” paper presents a new (old) approach to build LLMs. Why old? Intents. Remember simple chat bots, familiar isn’t it? Millions of atomic intents can make a whole deal. No magic.

Mixture of A Million Experts // new (old) idea of many specialized models

Written by sbagency

No responses yet