[FIXME][EP05] vLLM 源码讲解直播笔记

EP05: Prefix Caching

直播回看链接:https://www.youtube.com/watch?v=mWvqA_BNtsU

特别鸣谢:月球大叔, Cheng Yihua, Kaz 大佬带来的精彩讲解

📌 1. LLM使用KVCache进行推理的基本实现

输入: tokens, 已有的kvcache

输出: tokens, 更新后的kvcache

llm.interence(
input_tokens: list[int], # N tokens
previous_kv_cache: list[Tensor], # N tokens' kv cache M < N
) -> output_tokens, new_kv_cache

output_tokens: # N' new token
new_kv_cache: # kv cache of N + N' tokens

抛开vllm还是sglang等主流框架的具体实现细节不谈,大家可以思考下,KVCache的管理有哪些特点,应该怎么去实现它的具体功能?

⚡ 2. Prefix Matching (前缀匹配)

KVCache的存取本质上与传统的键-值对数据库是类似的,如Redis等,Key: [tokens],Value: [KV cache tensors]。

基本的功能包括KV的存储与获取,可以抽象成以下两个方法:

class KVCacheStore:
def store(tokens, kv_cache_tensors):
pass
def retrieve(tokens) -> kkv_cache_tensors
pass

与普通的键-值对数据库功能不同,KVCache具有前缀匹配的特点:

  • 假设我们有Tokens 1和Tokens 2:

    Tokens 1: ABCDE -> [KV1, KV2, KV3, KV4, KV5]

    Tokens 2: ABCDF -> [KV1, KV2, KV3, KV4, KV6]

  • 存入Tokens 1的KV:

    kv_cache_store.store(“ABCDE”, [KV1, KV2, KV3, KV4, KV5])

  • 取出Tokens 2的KV:

    kv_cache_store.retrieve(“ABCDF”) -> [KV1, KV2, KV3, KV4]

    由于最后一个token不同,因此只能返回前缀相同部分的对应的KV

  • Trie,也称为前缀树或字典树,是一种高效的树形数据结构,常用于存储和检索字符串集合。
    可以让Trie的每个节点代表一个token,从根节点到某个节点的路径表示一个token序列,特别适合处理前缀匹配问题。

💡 3. vLLM中KVCache的设计

  • 首先,会将tokens根据chunk_size进行分块,假设chunk_size=2

    “ABCDEF” -> “AB”, “CD”, “EF” -> list of chunked prefix hashes

    chunk_size的大小是一个trade-off,size太大,匹配的精确度低,size太小,I/O的效率差。
    在vLLM中,chunk_size默认为16, sglang可以为1。

    (这里说的chunk_size对应vllm代码中的block_size,注意区分与chunk prefill中chunk_size的概念。)

  • 然后,对chunks进行哈希,这里要注意的是,每个chunk的哈希值都包含了该chunk前缀的信息,
    用于模拟前缀树的实现

    prefix_hash = ""
    for chunk in chunk_tokens: # ["AB", "CD", "EF"]
    chunk_hash = hash(prefix + chunk)
    prefix_hash = chunk_hash
  • 接着是存取的基本实现

    # Given chunked prefix hashes, chunk kv cache
    # store
    for chunk_hash, chunk_kv in zip(...):
    redis.put(chunk_hash, chunk_kv)

    # retrieve
    for chunk_hash in ...:
    kv_chunk = redis.get(chunk_hash)
    if kv_chunk is None:
    break
  • Eviction(驱逐)

    随着推理的不断进行,当内存不够用了,需要对某些KV进行驱逐。

    Eviction的规则:对于request从后往前驱逐,尽可能复用prefix cache

    • “ABCDEF” –> [“AB”, KV1], [“CD”, KV2], [“EF”, KV3]
    • 从KV3,KV2,KV1的顺序从后往前驱逐

    可以使用LRU, LFU等驱逐策略

🔎 4. vLLM中KVCache的具体实现

Retrevie

“vllm/v1/core/sched/scheduler.py”

# Get already-cached tokens.
computed_blocks, num_computed_tokens = \
self.kv_cache_manager.get_computed_blocks(request)

get_computed_blocks的函数定义

def get_computed_blocks(
self, request: Request) -> tuple[list[KVCacheBlock], int]:
...
# The block hashes for the request may already be computed
# if the scheduler has tried to schedule the request before.
block_hashes = self.req_to_block_hashes[request.request_id]
if not block_hashes:
# 计算每个block的hash
block_hashes = hash_request_tokens(self.caching_hash_fn,
self.block_size, request)
self.req_to_block_hashes[request.request_id] = block_hashes
...
# 找到最长前缀匹配
computed_blocks = (
self.specialized_manager.find_longest_cache_hit(block_hashes))
self.prefix_cache_stats.queries += len(block_hashes)
self.prefix_cache_stats.hits += len(computed_blocks)
...
return computed_blocks, num_computed_tokens

Store

“vllm/v1/core/block_pool.py”

在”BlockPool”类中定义了KVCacheBlocks的管理方法

# 传入Request和算好的block,存入BlockPool中
def cache_full_blocks(
self,
request: Request,
blocks: list[KVCacheBlock],
block_hashes: list[BlockHashType],
num_cached_blocks: int,
num_full_blocks: int,
block_size: int,
hash_fn: Callable,
) -> None:
"""Cache a list of full blocks for prefix caching.
This function takes a list of blocks that will have their block hash
metadata to be updated and cached. Given a request, it computes the
block hashes for the blocks starting from `num_cached_blocks` to
`num_full_blocks`, updating the metadata for each block
and caching them in the `cached_block_hash_to_block`.

Eviction

发生在”BlockPool”类中的get_new_blocks函数中

def get_new_blocks(self, num_blocks: int) -> list[KVCacheBlock]:
...
curr_block = self.free_block_queue.popleft()
...
# 开启了enable_caching且如果没有空闲的block,需要evict掉旧的block
# If the block is cached, evict it.
if self.enable_caching:
self._maybe_evict_cached_block(curr_block)

curr_block.incr_ref()
ret.append(curr_block)
idx += 1

return ret
def _maybe_evict_cached_block(self, block: KVCacheBlock) -> bool:
"""
If a block is cached in `cached_block_hash_to_block`, we reset its hash
metadata and evict it from the cache.

Args:
block: The block to evict.

Returns:
True if the block is evicted, False otherwise.
"""
block_hash = block.block_hash
if block_hash and block_hash in self.cached_block_hash_to_block:
block.reset_hash()
del self.cached_block_hash_to_block[block_hash][block.block_id]

if len(self.cached_block_hash_to_block[block_hash]) == 0:
del self.cached_block_hash_to_block[block_hash]

return True
return False

free_block_queue.popleft()是在”vllm/v1/core/kv_cache_utils”的FreeKVCacheBlockQueue类中使用双向链表实现LRU的(经典leetcode题)

class FreeKVCacheBlockQueue:
...
def popleft(self) -> KVCacheBlock:
...
def remove(self, block: KVCacheBlock) -> None:
...
def append(self, block: KVCacheBlock) -> None:
...
def get_all_free_blocks(self) -> list[KVCacheBlock]:
...

🔋 5. KVCache感知的多实例路由

方案一

使用字符串匹配而不是基于 token-ID 的匹配。

  • tokenization(分词)本身相当慢(需要几微秒),所以对每个请求都执行会带来巨大的开销。
  • 实现字符串服务器(例如使用 Redis)作为存储后端。

方案二

路由器可以向 KV 缓存管理系统发送请求:哪个服务器有最长的匹配前缀?
vLLM production stack 团队的解决方案:
https://github.com/vllm-project/production-stack/issues/59

KVcache aware VS Load balance

这里有个trade-off, KV 缓存感知路由可能会将请求路由到同一个节点,这对负载均衡不利。