LLMs from scratch // C/CUDA
High level libs such as PyTourch are great // but have pros and cons as usual
High level PyTourch LLM implementation can be ported to C/CUDA for better hardware support and new porting capabilities. There is another project llama.cpp // but it is for inference
Here’s a summary of the key points from Andre Karpathy’s talk about LLM.C:
1. Origin: LLM.C started as a project to simplify and understand GPT training without the complexities of PyTorch abstractions.
2. Development process:
— Started by porting PyTorch code to C, layer by layer
— Implemented forward and backward passes manually
— Converted C code to CUDA for GPU acceleration
— Optimized kernels and implemented various performance improvements
3. Key features of LLM.C:
— Single file of C code with no dependencies
— Pre-planned memory allocation
— Fully deterministic
— Can run on minimal hardware
4. Collaborative effort:
— Started by Karpathy, but quickly attracted contributors from the internet
— Over 60 contributors, with core developers like Eric, Arun, and Alexar
5. Optimizations:
— Mixed precision (FP32, BF16)
— Kernel fusions
— Recompute settings
— Memory optimizations
— Multi-GPU and multi-node support
6. Current capabilities:
— Can train GPT-2 (1.6B parameters) on a single node of H100s in 24 hours
— 30% less memory usage and 20% faster than PyTorch implementation
7. Ongoing work:
— Adding Llama 3 support
— Implementing FP8 support
— Various forks for different architectures and languages
8. Future implications:
— LLM.C demonstrates the possibility of highly optimized, custom implementations
— As language models improve at coding, they could potentially generate optimized code like LLM.C for specific applications
— LLM.C could serve as example code for language models to learn from when generating custom implementations
The talk highlights the potential for AI-driven software development and the importance of understanding low-level implementations in the era of large language models.
LLM.c is a great project and can be used for different hardware platforms.
I gave a talk at GPU MODE workshop last week on llm.c
- the origin story of llm.c
- being naked in the world without PyTorch and having to re-invent Array, Autograd, Device, Dtype, Compile, Distributed
- how to port a PyTorch layer to 1) explicit PyTorch
- and then to 2) write the backward pass
- 3) port forward & backward pass to C
- 4) string all the layers together
- achieving one file of C with no dependencies that compiles and runs ~instantly, where all memory is pre-planned and allocated a single time, fully deterministic, portable code that can run on a potato or a von Neumann probe
- how most of llm.c was built at 1am-7am in a water villa porch in Maldives and why this is the recommended way to develop software
- convert all of it to run in CUDA on GPU in fp32
- port matmul to cuBLAS
- port attention to cuDNN flash-attention
- introduce bfloat16 mixed precision
- introduce many more optimizations and features like kernel fusions, Packed128, stochastic rounding, full determinism
- add multi-GPU training, NCCL, sharded optimizer
- add multi-node with MPI or file system or socket
- reproduce GPT-2 (1.6B) on one 8XH100 node in 24 hours for $672 in llm.c, achieving (at the time) 29% less memory, 19% faster training that PyTorch nightly, and much faster compile & run
- how open source development attracts Avengers from the internet
- port to training Llama 3 imminent (branch exists)
- many other notable forks
- last thought: how software abstractions like Python/PyTorch and everything else really exist only because humans are finite in knowledge, IQ and attention, and how with increasing AI capability LLMs may export custom binaries like llm.c for any application directly, tearing apart and refactoring all abstractions as needed.
<|endoftext|>More links in reply