Add XPU support to LMCache for CPU offloading#1
Conversation
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for Intel XPUs, which is a significant enhancement for platform compatibility. The changes primarily involve adding a Platform utility class for hardware detection and making various parts of the codebase platform-agnostic by replacing CUDA-specific calls with conditional logic. A new VLLMPagedMemXPUConnectorV2 is also added.
The overall approach is solid, but I've found a few critical issues where the optional import of lmcache.c_ops can lead to runtime errors on non-CUDA platforms, as the code doesn't handle the case where the import fails. I've also included some suggestions to improve code maintainability by reducing duplication and cleaning up comments.
| """Expect a kwarg 'kvcaches' which is a nested tuple of K and V tensors. | ||
| The kvcaches should correspond to the "WHOLE token sequence". | ||
|
|
||
| Note: | ||
| 1. This function expects the 'slot_mapping' is a "full slot mapping" | ||
| where it's length is the same as the whole token sequence. | ||
| 2. In the case that there is prefix caching, slot_mapping will starts | ||
| with -1s until the end of the matched prefix. The start and end | ||
| should NEVER overlap with the prefix caching (which means the | ||
| underlying CUDA kernel will never see -1 in slot_mapping) | ||
|
|
||
|
|
||
| :raises ValueError: If 'kvcaches' is not provided in kwargs. | ||
| :raises AssertionError: If the memory object does not have a tensor. | ||
| :raises ValueError: If 'slot_mapping' is not provided in kwargs. | ||
| """ |
There was a problem hiding this comment.
The docstring for to_gpu mentions "CUDA kernel" on line 121. Since this connector is for XPU, this is misleading. Please update the comment to be platform-agnostic. For example, you could replace "CUDA kernel" with "computation kernel" or "underlying kernel". The same issue exists in the docstring for from_gpu on line 177.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for Intel XPUs by adding a Platform utility to abstract device-specific logic. It includes a new VLLMPagedMemXPUConnectorV2 for XPU integration and updates various parts of the codebase to be device-agnostic. The changes are well-structured. However, there are critical issues where the optional import of CUDA-specific C operations (lmc_ops) is not handled correctly, which will lead to crashes on non-CUDA platforms like XPU when certain features (e.g., cachegen serde) are used. There are also some minor code cleanup opportunities in the new XPU connector.
44807c9 to
c65bb1f
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request adds XPU support to LMCache, primarily by introducing a Platform utility for device detection and implementing a new VLLMPagedMemXPUConnectorV2 for XPU devices. The changes are well-structured to abstract away device-specific logic. However, I've found a critical issue with stream initialization for XPU that could lead to a runtime error. Additionally, there are opportunities to improve maintainability by reducing code duplication and to enhance clarity by correcting docstrings and error messages in the new XPU connector. Addressing these points will strengthen the XPU integration.
a1f5f64 to
86ddb33
Compare
[bugfix] fix memory lead in fs connector bugfix Signed-off-by: idellzheng <idellzheng@tencent.com>
* add flash infer Signed-off-by: Samuel Shen <slshen@uchicago.edu> * missed one Signed-off-by: Samuel Shen <slshen@uchicago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchicago.edu> Co-authored-by: Samuel Shen <slshen@uchicago.edu>
86ddb33 to
e373da8
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for Intel XPUs to LMCache, which is a valuable addition for enabling CPU offloading on different hardware. The changes primarily involve abstracting device-specific code (CUDA vs. XPU) and adding a new VLLMPagedMemXPUConnectorV2 for XPU devices. The overall approach is solid, and the device-agnostic changes like replacing .cuda() with .to(device) are well-executed. My review focuses on improving maintainability by addressing code duplication, fixing minor documentation errors, and highlighting potential performance and correctness issues in the new XPU connector.
|
|
||
| # TODO(Jiayi): need to optimize to enable real batching | ||
| def batched_to_gpu(self, memory_objs, starts, ends, **kwargs): | ||
| for memory_obj, start, end in zip(memory_objs, starts, ends, strict=False): |
There was a problem hiding this comment.
Using zip with strict=False can hide potential bugs if the input iterables (memory_objs, starts, ends) have mismatched lengths, as it will silently truncate to the shortest one. It is safer to use strict=True (available in Python 3.10+) to raise a ValueError in such cases. If compatibility with older Python versions is required, adding an explicit length check before the loop would be a robust alternative.
| for memory_obj, start, end in zip(memory_objs, starts, ends, strict=False): | |
| for memory_obj, start, end in zip(memory_objs, starts, ends, strict=True): |
| if torch.cuda.is_available(): | ||
| torch_dev = torch.cuda | ||
| dev_name = "cuda" | ||
| elif torch.xpu.is_available(): | ||
| torch_dev = torch.xpu | ||
| dev_name = "xpu" | ||
| else: | ||
| raise RuntimeError("Unsupported device platform for LMCache engine.") |
There was a problem hiding this comment.
The logic to detect the available device (cuda or xpu) is duplicated in this function, both here and at lines 536-541. This could be refactored into a helper function to improve maintainability and reduce code duplication. For example, a function could return the device module and name, which can then be used in both places.
| logger = init_logger(__name__) | ||
|
|
||
|
|
||
| class VLLMPagedMemXPUConnectorV2(VLLMPagedMemGPUConnectorV2): |
There was a problem hiding this comment.
VLLMPagedMemXPUConnectorV2 inherits from VLLMPagedMemGPUConnectorV2, but it seems to override most of the core logic (__init__, to_gpu, from_gpu) and uses a different implementation strategy (pure PyTorch vs. custom CUDA kernels). The parent class also contains CUDA-specific code that makes direct inheritance for other devices problematic. This can lead to maintenance challenges.
A cleaner design might be for VLLMPagedMemXPUConnectorV2 to inherit directly from GPUConnectorInterface. Any logic that is truly common between the GPU and XPU connectors could be extracted into a shared base class or utility functions.
| 2. In the case that there is prefix caching, slot_mapping will starts | ||
| with -1s until the end of the matched prefix. The start and end | ||
| should NEVER overlap with the prefix caching (which means the | ||
| underlying CUDA kernel will never see -1 in slot_mapping) |
There was a problem hiding this comment.
The docstring mentions a CUDA kernel, which appears to be a copy-paste artifact from the GPU connector. To maintain accuracy and clarity, this should be updated to be device-agnostic, for example, by simply referring to it as the underlying kernel.
| underlying CUDA kernel will never see -1 in slot_mapping) | |
| should NEVER overlap with the prefix caching (which means the | |
| underlying kernel will never see -1 in slot_mapping) |
| for i, kvcache in enumerate(kvcaches): | ||
| kvcache.view(total_blocks, head_size).index_copy_(0, slices, tmp[i]) | ||
| else: | ||
| tmp_k = memory_obj.tensor[0].to(slot_mapping.device) | ||
| tmp_v = memory_obj.tensor[1].to(slot_mapping.device) | ||
| num_blocks, block_size, num_heads, head_size = kvcaches[0][0].shape | ||
| total_blocks = num_blocks * block_size | ||
| d = num_heads * head_size | ||
| for i, (kcache, vcache) in enumerate(kvcaches): | ||
| kcache.view(total_blocks, d).index_copy_(0, slices, tmp_k[i]) | ||
| vcache.view(total_blocks, d).index_copy_(0, slices, tmp_v[i]) |
There was a problem hiding this comment.
The explicit Python loop over layers to perform index_copy_ can introduce performance overhead, especially for models with a large number of layers. This is in contrast to the GPU connector which uses a single fused CUDA kernel for this operation. While this pure-PyTorch implementation is a great way to support XPU without custom kernels, it's worth noting this potential performance bottleneck. If performance becomes an issue, creating a batched version of this operation or a custom XPU kernel could be considered for optimization.
| 2. In the case that there is prefix caching, slot_mapping will starts | ||
| with -1s until the end of the matched prefix. The start and end | ||
| should NEVER overlap with the prefix caching (which means the | ||
| underlying CUDA kernel will never see -1 in slot_mapping) |
There was a problem hiding this comment.
Similar to a previous comment, the docstring mentions a CUDA kernel. This should be updated to be device-agnostic to avoid confusion.
| underlying CUDA kernel will never see -1 in slot_mapping) | |
| should NEVER overlap with the prefix caching (which means the | |
| underlying kernel will never see -1 in slot_mapping) |
| tmp = torch.stack( | ||
| [ | ||
| kvcache.view(total_blocks, head_size).index_select(0, slices) | ||
| for kvcache in kvcaches | ||
| ] | ||
| ) | ||
| else: | ||
| num_blocks, block_size, num_heads, head_size = kvcaches[0][0].shape | ||
| total_blocks = num_blocks * block_size | ||
| d = num_heads * head_size | ||
| tmp_k = torch.stack( | ||
| [ | ||
| kvcache[0].view(total_blocks, d).index_select(0, slices) | ||
| for kvcache in kvcaches | ||
| ] | ||
| ) | ||
| tmp_v = torch.stack( | ||
| [ | ||
| kvcache[1].view(total_blocks, d).index_select(0, slices) | ||
| for kvcache in kvcaches | ||
| ] | ||
| ) | ||
| tmp = torch.stack([tmp_k, tmp_v]) |
There was a problem hiding this comment.
Similar to the to_gpu method, the use of list comprehensions here to iterate over layers ([... for kvcache in kvcaches]) effectively creates a Python loop. This can be a performance bottleneck compared to a fused kernel, especially for models with many layers. It's a good idea to be aware of this potential performance impact.
* initial Signed-off-by: Samuel Shen <slshen@uchicago.edu> * finished documenationgit branch Signed-off-by: Samuel Shen <slshen@uchicago.edu> * change color Signed-off-by: Samuel Shen <slshen@uchicago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchicago.edu> Co-authored-by: Samuel Shen <slshen@uchicago.edu>
…Cache#1609) add --enforce-strict-concurrent-users option to allow user count enforcement in multi-round-qa.py
fix pin_count type Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: Kobe Chen <xiaokunchen0@gmail.com>
Signed-off-by: Kobe Chen <xiaokunchen0@gmail.com>
* fix lookup Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> * fix format Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> * fix bugs Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> * fix small bugs Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> --------- Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: idellzheng <idellzheng@tencent.com>
…connector as first step (LMCache#1696) * Implement remove api for remote backend and fs_connector as first step Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Update lmcache/v1/storage_backend/connector/fs_connector.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: maobaolong <307499405@qq.com> * Update lmcache/v1/storage_backend/remote_backend.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: maobaolong <307499405@qq.com> * Update lmcache/v1/storage_backend/remote_backend.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: maobaolong <307499405@qq.com> * Fix style Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Add remove_sync method to InstrumentedRemoteConnector * Add remove_sync method for key audit logging --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com> Signed-off-by: maobaolong <307499405@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Async redis cluster remote connector + unit tests Signed-off-by: Lindsey Wen <lindwen@amazon.com> Signed-off-by: Lindsey Wen <lindseywen@ucsb.edu> * Run pre-commit Signed-off-by: Lindsey Wen <lindseywen@ucsb.edu> --------- Signed-off-by: Lindsey Wen <lindwen@amazon.com> Signed-off-by: Lindsey Wen <lindseywen@ucsb.edu> Co-authored-by: Lindsey Wen <lindwen@amazon.com>
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.6.2 to 5.0.0. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@ea165f8...330a01c) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-version: 5.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…1937) Follow up documentation / improvements to come. * [Core] Add SageMaker HyperPod remote connector Signed-off-by: Ziwen Ning <ningziwe@amazon.com> * Address comments Signed-off-by: Ziwen Ning <ningziwe@amazon.com> --------- Signed-off-by: Ziwen Ning <ningziwe@amazon.com>
…okensInternalResult (LMCache#1953) Signed-off-by: baoloongmao <baoloongmao@tencent.com>
fix Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
* filter out 0 hit rate Signed-off-by: idellzheng <idellzheng@tencent.com> * rename Signed-off-by: idellzheng <idellzheng@tencent.com> --------- Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: Ziwen Ning <ningziwe@amazon.com>
Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
* [refactor] unified reconstruct cache engine key Signed-off-by: idellzheng <idellzheng@tencent.com> * checkstyle fix Signed-off-by: idellzheng <idellzheng@tencent.com> * move method to CacheEngineKey Signed-off-by: idellzheng <idellzheng@tencent.com> --------- Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
32f673f to
866d32b
Compare
Support get the env of current process Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Adapt hash func to vllm Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: Ziwen Ning <ningziwe@amazon.com>
LMCache#1827) * [bugfix]: only update skip_leading_tokens on last PP rank in wait_for_save [bugfix]: only update skip_leading_tokens on last PP rank in wait_for_save Signed-off-by: 董佩琪 <dong.peiqi@neolink.com> * format code with ruff Signed-off-by: 董佩琪 <dong.peiqi@neolink.com> * [bugfix]: only update skip_leading_tokens on last PP rank in wait_for_save Signed-off-by: 董佩琪 <dong.peiqi@neolink.com> --------- Signed-off-by: 董佩琪 <dong.peiqi@neolink.com> Co-authored-by: 董佩琪 <dong.peiqi@neolink.com>
…_address (LMCache#1971) LMCACHE_REMOTE_URL overrides master_server_address even when explicitly configured (mooncake HA mode issue) Signed-off-by: 董佩琪 <dong.peiqi@neolink.com> Co-authored-by: 董佩琪 <dong.peiqi@neolink.com>
…e#1965) * [bugfix] do not start lmcache worker on scheduler Signed-off-by: idellzheng <idellzheng@tencent.com> * Update lmcache/v1/cache_engine.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: maobaolong <307499405@qq.com> --------- Signed-off-by: idellzheng <idellzheng@tencent.com> Signed-off-by: maobaolong <307499405@qq.com> Co-authored-by: maobaolong <307499405@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [controller] simplify config when p2p is not enabled Signed-off-by: idellzheng <idellzheng@tencent.com> * checkstyle fix Signed-off-by: idellzheng <idellzheng@tencent.com> * optimize Signed-off-by: idellzheng <idellzheng@tencent.com> --------- Signed-off-by: idellzheng <idellzheng@tencent.com>
* [Model] Added support for Qwen2 Signed-off-by: Lee Yao Yang <leey0227@e.ntu.edu.sg> * [Refactor] Remove duplicating code files for Qwen2 support Signed-off-by: Lee Yao Yang <leey0227@e.ntu.edu.sg> --------- Signed-off-by: Lee Yao Yang <leey0227@e.ntu.edu.sg>
Added office hours Signed-off-by: Nicolas (Nick) Barcet <nicolas@barcet.com>
…MCache#1655) feat(mooncake): add NUMA affinity and batched operations support Signed-off-by: Jinyang Su <751080330@qq.com> Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com>
* [optimize] optimize connection of remote backend Signed-off-by: idellzheng <idellzheng@tencent.com> * adapt remote monitor Signed-off-by: idellzheng <idellzheng@tencent.com> * checkstyle fix Signed-off-by: idellzheng <idellzheng@tencent.com> * do not initialize storage manager Signed-off-by: idellzheng <idellzheng@tencent.com> * remove remote backend change Signed-off-by: idellzheng <idellzheng@tencent.com> * checkstyle fix Signed-off-by: idellzheng <idellzheng@tencent.com> * checkstyle fix Signed-off-by: idellzheng <idellzheng@tencent.com> * bugfix Signed-off-by: idellzheng <idellzheng@tencent.com> --------- Signed-off-by: idellzheng <idellzheng@tencent.com>
* Introduce a dynamic expend memory allocator Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: Ziwen Ning <ningziwe@amazon.com>
* Add automerge-labeler.yml Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu> * Add permissions Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu> --------- Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu>
Limit the p2p latency threshold Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu>
TODO:
offload to cpu
offload to local disk