[FIXME][EP05] vLLM 源码讲解直播笔记

EP05: Prefix Caching

直播回看链接：https://www.youtube.com/watch?v=mWvqA_BNtsU

特别鸣谢：月球大叔, Cheng Yihua, Kaz 大佬带来的精彩讲解

📌 1. LLM使用KVCache进行推理的基本实现

输入: tokens, 已有的kvcache

输出: tokens, 更新后的kvcache

llm.interence(
    input_tokens: list[int],  # N tokens
    previous_kv_cache: list[Tensor],  # N tokens' kv cache M < N
) -> output_tokens, new_kv_cache

output_tokens: # N' new token
new_kv_cache: # kv cache of N + N' tokens

抛开vllm还是sglang等主流框架的具体实现细节不谈，大家可以思考下，KVCache的管理有哪些特点，应该怎么去实现它的具体功能？

⚡ 2. Prefix Matching (前缀匹配)

KVCache的存取本质上与传统的键-值对数据库是类似的，如Redis等，Key: [tokens]，Value: [KV cache tensors]。

基本的功能包括KV的存储与获取，可以抽象成以下两个方法：

class KVCacheStore:
    def store(tokens, kv_cache_tensors):
        pass
    def retrieve(tokens) -> kkv_cache_tensors
        pass

与普通的键-值对数据库功能不同，KVCache具有前缀匹配的特点：

假设我们有Tokens 1和Tokens 2:

Tokens 1: ABCDE -> [KV1, KV2, KV3, KV4, KV5]

Tokens 2: ABCDF -> [KV1, KV2, KV3, KV4, KV6]
存入Tokens 1的KV:

kv_cache_store.store(“ABCDE”, [KV1, KV2, KV3, KV4, KV5])
取出Tokens 2的KV:

kv_cache_store.retrieve(“ABCDF”) -> [KV1, KV2, KV3, KV4]

由于最后一个token不同，因此只能返回前缀相同部分的对应的KV
Trie，也称为前缀树或字典树，是一种高效的树形数据结构，常用于存储和检索字符串集合。
可以让Trie的每个节点代表一个token，从根节点到某个节点的路径表示一个token序列，特别适合处理前缀匹配问题。

💡 3. vLLM中KVCache的设计

首先，会将tokens根据chunk_size进行分块，假设chunk_size=2

“ABCDEF” -> “AB”, “CD”, “EF” -> list of chunked prefix hashes

chunk_size的大小是一个trade-off，size太大，匹配的精确度低，size太小，I/O的效率差。
在vLLM中，chunk_size默认为16, sglang可以为1。

（这里说的chunk_size对应vllm代码中的block_size，注意区分与chunk prefill中chunk_size的概念。）
然后，对chunks进行哈希，这里要注意的是，每个chunk的哈希值都包含了该chunk前缀的信息，
用于模拟前缀树的实现
prefix_hash = ""
for chunk in chunk_tokens: # ["AB", "CD", "EF"]
chunk_hash = hash(prefix + chunk)
prefix_hash = chunk_hash

接着是存取的基本实现

# Given chunked prefix hashes, chunk kv cache
# store
for chunk_hash, chunk_kv in zip(...):
    redis.put(chunk_hash, chunk_kv)

# retrieve
for chunk_hash in ...:
    kv_chunk = redis.get(chunk_hash)
    if kv_chunk is None:
        break

Eviction（驱逐）

随着推理的不断进行，当内存不够用了，需要对某些KV进行驱逐。

Eviction的规则：对于request从后往前驱逐，尽可能复用prefix cache
- “ABCDEF” –> [“AB”, KV1], [“CD”, KV2], [“EF”, KV3]
- 从KV3，KV2，KV1的顺序从后往前驱逐
可以使用LRU, LFU等驱逐策略

🔎 4. vLLM中KVCache的具体实现

Retrevie

“vllm/v1/core/sched/scheduler.py”

# Get already-cached tokens.
computed_blocks, num_computed_tokens = \
    self.kv_cache_manager.get_computed_blocks(request)

get_computed_blocks的函数定义

def get_computed_blocks(
            self, request: Request) -> tuple[list[KVCacheBlock], int]:
        ...
        # The block hashes for the request may already be computed
        # if the scheduler has tried to schedule the request before.
        block_hashes = self.req_to_block_hashes[request.request_id]
        if not block_hashes:
            # 计算每个block的hash
            block_hashes = hash_request_tokens(self.caching_hash_fn,
                                               self.block_size, request)
            self.req_to_block_hashes[request.request_id] = block_hashes
        ...
        # 找到最长前缀匹配
        computed_blocks = (
            self.specialized_manager.find_longest_cache_hit(block_hashes))
        self.prefix_cache_stats.queries += len(block_hashes)
        self.prefix_cache_stats.hits += len(computed_blocks)
        ...
        return computed_blocks, num_computed_tokens

Store

“vllm/v1/core/block_pool.py”

在”BlockPool”类中定义了KVCacheBlocks的管理方法

# 传入Request和算好的block，存入BlockPool中
def cache_full_blocks(
        self,
        request: Request,
        blocks: list[KVCacheBlock],
        block_hashes: list[BlockHashType],
        num_cached_blocks: int,
        num_full_blocks: int,
        block_size: int,
        hash_fn: Callable,
    ) -> None:
        """Cache a list of full blocks for prefix caching.
        This function takes a list of blocks that will have their block hash
        metadata to be updated and cached. Given a request, it computes the
        block hashes for the blocks starting from `num_cached_blocks` to
        `num_full_blocks`, updating the metadata for each block
        and caching them in the `cached_block_hash_to_block`.

Eviction

发生在”BlockPool”类中的get_new_blocks函数中

def get_new_blocks(self, num_blocks: int) -> list[KVCacheBlock]:
        ...
            curr_block = self.free_block_queue.popleft()
            ...
            # 开启了enable_caching且如果没有空闲的block，需要evict掉旧的block
            # If the block is cached, evict it.
            if self.enable_caching:
                self._maybe_evict_cached_block(curr_block)

            curr_block.incr_ref()
            ret.append(curr_block)
            idx += 1

        return ret

def _maybe_evict_cached_block(self, block: KVCacheBlock) -> bool:
        """
        If a block is cached in `cached_block_hash_to_block`, we reset its hash
        metadata and evict it from the cache.

        Args:
            block: The block to evict.

        Returns:
            True if the block is evicted, False otherwise.
        """
        block_hash = block.block_hash
        if block_hash and block_hash in self.cached_block_hash_to_block:
            block.reset_hash()
            del self.cached_block_hash_to_block[block_hash][block.block_id]

            if len(self.cached_block_hash_to_block[block_hash]) == 0:
                del self.cached_block_hash_to_block[block_hash]

            return True
        return False

free_block_queue.popleft()是在”vllm/v1/core/kv_cache_utils”的FreeKVCacheBlockQueue类中使用双向链表实现LRU的（经典leetcode题）

class FreeKVCacheBlockQueue:
    ...
    def popleft(self) -> KVCacheBlock:
    ...
    def remove(self, block: KVCacheBlock) -> None:
    ...
    def append(self, block: KVCacheBlock) -> None:
    ...
    def get_all_free_blocks(self) -> list[KVCacheBlock]:
    ...