AI inference challenges and opportunities // hardware acceleration
The article discusses the challenges faced in the widespread adoption of generative AI due to the high cost of running inference. The expenses involved in training Large Language Models (LLMs) are highlighted, with estimates suggesting significant spending on GPUs by companies like Meta. The demand for memory and bandwidth in generative AI poses a major obstacle for scaling models.
Current AI hardware choices, such as CPUs, GPUs, and custom accelerators, face limitations, especially with the traditional Von Neumann architecture. The need for large amounts of RAM and memory bandwidth makes generative AI economically challenging. Despite these challenges, the demand for inference is expected to increase, requiring more compute power.
The article introduces the concept of the “memory wall” and explores the inefficiency of data movement between storage and processing in traditional architectures. It emphasizes the significant energy consumption associated with data transfer compared to actual computation. In-memory computing (IMC) is proposed as a promising alternative, where multiply-accumulate (MAC) operations are performed near or in memory cells.
Digital In-Memory Computing (DIMC) is presented as an alternative to analog IMC, offering noise-free computation and greater flexibility. The article suggests that DIMC has the potential to revolutionize AI inference by lowering costs and improving performance. The focus is on reducing unnecessary data movement to enhance AI efficiency and economics. The article concludes by introducing d-Matrix, a startup working on AI chips for generative AI inference based on DIMC technology.
Some noteworthy developments over the last few weeks as new models like OpenAI’s Sora and Google’s Gemma open-source family debut and become more capable. These new models are opening up exciting possibilities for new applications. However, the cost of inference and escalating demands for memory and bandwidth continue to pose a very real barrier to the widespread adoption of generative AI.
We set out to solve this at d-Matrix so enterprises can take full advantage of fast-evolving model technology using cost efficient solutions purpose-built for inference. Our combination of 2048 digital in-memory compute (DIMC) cores, along with 16GB of SRAM and >1TB of LPDDR in a single node 8-card AI server, enables us to deliver industry-leading throughput and low latency inference for a wide range of models. For instance, with the Llama 70B model, we expect to be able to generate 350 tokens per second with 48 Corsair cards, vastly improving TCO over incumbent solutions, towards our mission of commercially viable generative AI for all.
Our VP of Product Sree Ganesan discusses the need for more efficient compute options and the potential of IMC technology to solve challenges of generative AI in insideHPC.