A guide to open-source LLM inference and performance
Calculating the operations-to-byte ratio of your GPU. Calculating the arithmetic intensity of your LLM. Comparing ops:byte to arithmetic intensity to discover if inference is compute bound or memory bound. GPU Memory Bandwidth: We can move 600 GB/s from GPU memory to our on-chip processing units. If we find ourselves only able to complete fewer than 208.3 operations per byte, our system performance is memory bound. In this state, our effectiveness and performance are restrained not by the memory, but rather the number of compute units that our chip possesses. All three steps follow the same pattern: load values from memory, perform a computation, and store the results of that computation to memory. Generating a single token on each GPU. Recall that during the autoregressive part of generation, we are memory bandwidth bound if our batch size is 1. We want to make the most of compute capacity during LLM inference, but we can't do that when we're memory bound.