LLM on FPGA // why? cybersecurity

FPGA is the most advanced hw cybersec technology, LLMs are the new processors of the future

sbagency
8 min readSep 10, 2024
https://arxiv.org/pdf/2401.03868

Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs’ efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and quantization are commonly used to mitigate the gap between LLM’s computation/memory overheads and hardware capacity. However, existing GPU and transformer-based accelerators cannot efficiently process compressed LLMs, due to the following unresolved challenges: low computational efficiency, underutilized memory bandwidth, and large compilation overheads. This paper proposes FlightLLM, enabling efficient LLMs inference with a complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative solution that the computation and memory overhead of LLMs can be solved by utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory hierarchy). We propose a configurable sparse DSP chain to support different sparsity patterns with high computation efficiency. Second, we propose an always-on-chip decode scheme to boost memory bandwidth with mixed-precision support. Finally, to make FlightLLM available ∗Both authors contributed equally to this research. †Corresponding authors. Email: daiguohao@sjtu.edu.cn, yu-wang@tsinghua.edu.cn Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). FPGA ’24, March 3–5, 2024, Monterey, CA, USA © 2024 Copyright held by the owner/author(s). ACM ISBN 979–8–4007–0418–5/24/03. https://doi.org/10.1145/3626202.3637562 for real-world LLMs, we propose a length adaptive compilation method to reduce the compilation overhead. Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0× higher energy efficiency and 1.8× better cost efficiency against commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2–7B) using vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100 GPU with 1.2× higher throughput using the latest Versal VHK158 FPGA.

To address these issues, the paper introduces FlightLLM, a system for improving LLM inference on FPGAs. FlightLLM leverages FPGA-specific resources to optimize computation and memory usage, tackling challenges like low computation efficiency, underutilized memory bandwidth, and large compilation overheads. The system includes innovations like a configurable sparse DSP chain for improved computation efficiency, an on-chip decode scheme to boost memory bandwidth, and a length-adaptive compilation method to reduce storage overhead.

FlightLLM, when tested on various models, shows significant improvements in latency, energy, and cost efficiency compared to NVIDIA GPUs.

We design a high-performance FPGA-based accelerator for generative LLMs by making full use of FPGA resources. Combined with compression techniques like sparsification and quantization, FlightLLM can effectively accelerate the generative LLMs and reduce the inference overhead. As shown in Fig. 4, the overall hardware architecture of FlightLLM mainly includes a task scheduler, memory controller, and multiple computing cores (short as cores). The accelerator uses model parallelism on multiple cores to complete the LLM inference task. The task scheduler assigns tasks to different cores and controls data synchronization. The components of each core include the unified Matrix Processing Engine (MPE), Memory Management Unit (MMU), Special Function Unit (SFU), and Instruction Scheduler. The instruction scheduler decodes the input instructions and schedules different hardware units to perform computations. The main functions of the remaining hardware units are as follows: MPE handles all matrix (i.e., dense and sparse) operations in LLMs. MPE utilizes the configurable sparse DSP chain to reduce the hardware overhead on FPGA. MMU reduces memory access overheads by designing customized quantization units for low-bit mixed-precision and optimizing data placement for off-chip memory. SFU handles miscellaneous operations (e.g., Softmax, etc.) besides matrix processing operations. It also provides an additional data path to share data with other SFUs in different cores, accelerating the MV operation.

https://arxiv.org/pdf/2409.03384

Large Language Models (LLMs) have emerged as powerful tools for natural language processing tasks, revolutionizing the field with their ability to understand and generate humanlike text. In this paper, we present a comprehensive survey of the several research efforts that have been presented for the acceleration of transformer networks for Large Language Models using hardware accelerators. The survey presents the frameworks that have been proposed and then performs a qualitative and quantitative comparison regarding the technology, the processing platform (FPGA, ASIC, In-Memory, GPU), the speedup, the energy efficiency, the performance (GOPs), and the energy efficiency (GOPs/W) of each framework. The main challenge in comparison is that every proposed scheme is implemented on a different process technology making hard a fair comparison. The main contribution of this paper is that we extrapolate the results of the performance and the energy efficiency on the same technology to make a fair comparison; one theoretical and one more practical. We implement part of the LLMs on several FPGA chips to extrapolate the results to the same process technology and then we make a fair comparison of the performance.

https://arxiv.org/pdf/2406.02528

Matrix multiplication (MatMul) typically dominates the overall computational cost of large language models (LLMs). This cost only grows as LLMs scale to larger embedding dimensions and context lengths. In this work, we show that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. Our experiments show that our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers that require far more memory during inference at a scale up to at least 2.7B parameters. We investigate the scaling laws and find that the performance gap between our MatMul-free models and full precision Transformers narrows as the model size increases. We also provide a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an unoptimized baseline during training. By utilizing an optimized kernel during inference, our model’s memory consumption can be reduced by more than 10× compared to unoptimized models. To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency. This work not only shows how far LLMs can be stripped back while still performing effectively, but also points at the types of operations future accelerators should be optimized for in processing the next generation of lightweight LLMs. Our code implementation is available at https://github. com/ridgerchu/matmulfreellm.

https://github.com/ridgerchu/matmulfreellm
https://arxiv.org/pdf/2406.07177

Large language models (LLMs) have achieved remarkable performance on Natural Language Processing (NLP) tasks, but they are hindered by high computational costs and memory requirements. Ternarization, an extreme form of quantization, offers a solution by reducing memory usage and enabling energy-efficient floating-point additions. However, applying ternarization to LLMs faces challenges stemming from outliers in both weights and activations. In this work, observing asymmetric outliers and non-zero means in weights, we introduce Dual Learnable Ternarization (DLT), which enables both scales and shifts to be learnable. We also propose Outlier-Friendly Feature Knowledge Distillation (OFF) to recover the information lost in extremely low-bit quantization. The proposed OFF can incorporate semantic information and is insensitive to outliers. At the core of OFF is maximizing the mutual information between features in ternarized and floating-point models using cosine similarity. Extensive experiments demonstrate that our TernaryLLM surpasses previous low-bit quantization methods on the standard text generation and zero-shot benchmarks for different LLM families. Specifically, for one of the most powerful open-source models, LLaMA-3, our approach (W1.58A16) outperforms the previous state-of-the-art method (W2A16) by 5.8 in terms of perplexity on C4 and by 8.2% in terms of average accuracy on zero-shot tasks.

https://drive.google.com/file/d/14953tADjNV2_mle0b9VxxzuElDf8djrT/view
https://www.achronix.com/blog/accelerating-llm-inferencing-fpgas

Reference c++ implementations

https://github.com/ggerganov/llama.cpp

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.

Plain C/C++ implementation without any dependencies

Apple silicon is a first-class citizen — optimized via ARM NEON, Accelerate and Metal frameworks

AVX, AVX2 and AVX512 support for x86 architectures

1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use

Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP)

Vulkan and SYCL backend support

CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity

https://github.com/DeepWok/mase

The development of machine learning (ML) accelerators is being outpaced by the rapid evolution of ML models, making many accelerators quickly obsolete. While existing design tools can prototype accelerators, they are limited to models that fit on a single device. With the rise of large models like GPT-3, there is a growing need for prototyping hardware systems that use many accelerators. MASE addresses this by providing a scalable solution for mapping large ML models onto an efficient streaming accelerator system. It achieves better energy efficiency than GPUs for inference on recent transformer models.

Alternative implementations

https://www.youtube.com/watch?v=ksgLoPxEQzM
https://sambanova.ai/technology/sn40l-rdu-ai-chip

--

--

sbagency

Tech/biz consulting, analytics, research for founders, startups, corps and govs.