Inference optimization // swap model parameters to/from drive (flash)

3 min readDec 31, 2023

The idea is simple, compute model layers sequentially (load only one layer at the moment), it's slower, but allows process large models on low memory devices.

https://github.com/lyogavin/Anima/tree/main/air_llm

https://ai.gopubby.com/unbelievable-run-70b-llm-inference-on-a-single-4gb-gpu-with-this-new-technique-93e2057c7eeb

https://medium.com/ai-advances/how-your-ordinary-8gb-macbooks-untapped-ai-power-can-run-70b-llm-models-that-will-blow-your-mind-134aa62edb22

Same idea: swap model parameters // LLM in a flash

Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their intensive computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters on flash memory but bringing them on demand to DRAM. Our method involves constructing an inference cost model that harmonizes with the flash memory behavior, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Within this flash memory-informed framework, we introduce two principal techniques. First, “windowing” strategically reduces data transfer by reusing previously activated neurons, and second, “row-column bundling”, tailored to the sequential data access strengths of flash memory, increases the size of data chunks read from flash memory. These methods collectively enable running models up to twice the size of the available DRAM, with a 4–5x and 20–25x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively. Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for effective inference of LLMs on devices with limited memory.

SLM: TinyLlama-1.1B

https://github.com/jzhang38/TinyLlama
The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens. With some proper optimization, we can achieve this within a span of “just” 90 days using 16 A100–40G GPUs 🚀🚀. The training has started on 2023–09–01.
We adopted exactly the same architecture and tokenizer as Llama 2. This means TinyLlama can be plugged and played in many open-source projects built upon Llama. Besides, TinyLlama is compact with only 1.1B parameters. This compactness allows it to cater to a multitude of applications demanding a restricted computation and memory footprint.

Inference optimization // swap model parameters to/from drive (flash)

Same idea: swap model parameters // LLM in a flash

SLM: TinyLlama-1.1B

Written by sbagency

No responses yet