2 Comments

Concerning hardware resources, is the LLM process MIPS or FLOPS intensive – the article implies both, maybe?

Expand full comment
author

Hey Patrick,

TLDR: the LLM inference process has 2 parts: prompt processing which is compute-bound and decoding which is memory-bound. If you're use case is input-heavy e.g. Feeding a very long document and asking the model to classify it as important or not, you'll be mostly compute bound and will see massive speedup if you use GPUs with high FLOPs. If on the other hand you're output heavy (e.g. "Generate a 10 page article about capitalism"), you're mainly memory-bound and won't be utilizing the GPU effectively.

Long version:

This is a good question. Let's see an example where the user message is "What the capital of France?"

We will apply the chat / instruction template to this message. Using the Vicuna template, the input to the model becomes:

input = \

"# User:

What is the capital of France?

# Assistant:

"

We expect the model's response to be:

output = "The capital of France is Paris."

The LLM inference process (i.e. using the model to generate text based on some input) consists of 2 parts:

1. Prompt processing. This is also known as prompt eval or prefill phase. In this part, the model processes the input text (or tokens to be precise) and calculates 2 quantities for each token: The first quantity is known as the "key" and the second one is known as the "value".

In this part, we have access to the entire input so we can actually parallelize this computation and utilize the massive compute capabilities of GPUs. As a result, the prompt processing step is mainly compute-bound (i.e. FLOPs intensive)

2. Decoding. This is also known as generation or eval. In this part, the model will generate the output text one token at a time. It will utilize the "key" and "value" quantities from step 1 and do one forward pass to calculate the logits, which we then sample from. This process is serial in nature and can't be parallelized easily*. As a result, the generation step is mainly memory-bound as for each token, we have to read the entire model from the memory (ram or vram) and place it near the processor (registers for CPU and SRAM for GPU) just to generate one token.

*There are ways to overcome the serial nature of decoding by techniques such as speculative decoding, but this not the standard approach.

Expand full comment