Skip to content

Add XPU support to LMCache for CPU offloading#1

Closed
zhenwei-intel wants to merge 133 commits into
devfrom
add_xpu
Closed

Add XPU support to LMCache for CPU offloading#1
zhenwei-intel wants to merge 133 commits into
devfrom
add_xpu

Conversation

@zhenwei-intel

@zhenwei-intel zhenwei-intel commented Sep 10, 2025

Copy link
Copy Markdown
Owner

TODO:

  • add test
  • mla check
  • rebase dev
  • disk offload
  • optimize from_gpu and to_gpu
image image

offload to cpu

image

offload to local disk

image image

@zhenwei-intel

Copy link
Copy Markdown
Owner Author

/gemini review

This comment was marked as duplicate.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Intel XPUs, which is a significant enhancement for platform compatibility. The changes primarily involve adding a Platform utility class for hardware detection and making various parts of the codebase platform-agnostic by replacing CUDA-specific calls with conditional logic. A new VLLMPagedMemXPUConnectorV2 is also added.

The overall approach is solid, but I've found a few critical issues where the optional import of lmcache.c_ops can lead to runtime errors on non-CUDA platforms, as the code doesn't handle the case where the import fails. I've also included some suggestions to improve code maintainability by reducing duplication and cleaning up comments.

Comment thread lmcache/storage_backend/serde/cachegen_decoder.py Outdated
Comment thread lmcache/storage_backend/serde/cachegen_encoder.py Outdated
Comment thread lmcache/v1/compute/positional_encoding.py Outdated
Comment thread lmcache/usage_context.py Outdated
Comment on lines +112 to +127
"""Expect a kwarg 'kvcaches' which is a nested tuple of K and V tensors.
The kvcaches should correspond to the "WHOLE token sequence".

Note:
1. This function expects the 'slot_mapping' is a "full slot mapping"
where it's length is the same as the whole token sequence.
2. In the case that there is prefix caching, slot_mapping will starts
with -1s until the end of the matched prefix. The start and end
should NEVER overlap with the prefix caching (which means the
underlying CUDA kernel will never see -1 in slot_mapping)


:raises ValueError: If 'kvcaches' is not provided in kwargs.
:raises AssertionError: If the memory object does not have a tensor.
:raises ValueError: If 'slot_mapping' is not provided in kwargs.
"""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for to_gpu mentions "CUDA kernel" on line 121. Since this connector is for XPU, this is misleading. Please update the comment to be platform-agnostic. For example, you could replace "CUDA kernel" with "computation kernel" or "underlying kernel". The same issue exists in the docstring for from_gpu on line 177.

Comment thread lmcache/v1/xpu_connector.py Outdated
@zhenwei-intel

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Intel XPUs by adding a Platform utility to abstract device-specific logic. It includes a new VLLMPagedMemXPUConnectorV2 for XPU integration and updates various parts of the codebase to be device-agnostic. The changes are well-structured. However, there are critical issues where the optional import of CUDA-specific C operations (lmc_ops) is not handled correctly, which will lead to crashes on non-CUDA platforms like XPU when certain features (e.g., cachegen serde) are used. There are also some minor code cleanup opportunities in the new XPU connector.

Comment thread lmcache/storage_backend/serde/cachegen_decoder.py Outdated
Comment thread lmcache/storage_backend/serde/cachegen_encoder.py Outdated
Comment thread lmcache/v1/compute/positional_encoding.py Outdated
Comment thread lmcache/v1/xpu_connector.py Outdated
Comment thread lmcache/v1/xpu_connector.py Outdated
@zhenwei-intel zhenwei-intel changed the title Add xpu Add XPU support to LMCache for CPU offloading Sep 11, 2025
@zhenwei-intel zhenwei-intel force-pushed the add_xpu branch 3 times, most recently from 44807c9 to c65bb1f Compare September 15, 2025 07:49
@zhenwei-intel

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds XPU support to LMCache, primarily by introducing a Platform utility for device detection and implementing a new VLLMPagedMemXPUConnectorV2 for XPU devices. The changes are well-structured to abstract away device-specific logic. However, I've found a critical issue with stream initialization for XPU that could lead to a runtime error. Additionally, there are opportunities to improve maintainability by reducing code duplication and to enhance clarity by correcting docstrings and error messages in the new XPU connector. Addressing these points will strengthen the XPU integration.

Comment thread lmcache/v1/storage_backend/storage_manager.py Outdated
Comment thread lmcache/v1/xpu_connector.py
Comment thread lmcache/v1/xpu_connector.py
Comment thread lmcache/v1/xpu_connector.py
Comment thread lmcache/v1/xpu_connector.py
@zhenwei-intel zhenwei-intel force-pushed the add_xpu branch 4 times, most recently from a1f5f64 to 86ddb33 Compare September 15, 2025 12:03
chunxiaozheng and others added 2 commits September 29, 2025 10:19
[bugfix] fix memory lead in fs connector


bugfix

Signed-off-by: idellzheng <idellzheng@tencent.com>
* add flash infer

Signed-off-by: Samuel Shen <slshen@uchicago.edu>

* missed one

Signed-off-by: Samuel Shen <slshen@uchicago.edu>

---------

Signed-off-by: Samuel Shen <slshen@uchicago.edu>
Co-authored-by: Samuel Shen <slshen@uchicago.edu>
@zhenwei-intel

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Intel XPUs to LMCache, which is a valuable addition for enabling CPU offloading on different hardware. The changes primarily involve abstracting device-specific code (CUDA vs. XPU) and adding a new VLLMPagedMemXPUConnectorV2 for XPU devices. The overall approach is solid, and the device-agnostic changes like replacing .cuda() with .to(device) are well-executed. My review focuses on improving maintainability by addressing code duplication, fixing minor documentation errors, and highlighting potential performance and correctness issues in the new XPU connector.


# TODO(Jiayi): need to optimize to enable real batching
def batched_to_gpu(self, memory_objs, starts, ends, **kwargs):
for memory_obj, start, end in zip(memory_objs, starts, ends, strict=False):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using zip with strict=False can hide potential bugs if the input iterables (memory_objs, starts, ends) have mismatched lengths, as it will silently truncate to the shortest one. It is safer to use strict=True (available in Python 3.10+) to raise a ValueError in such cases. If compatibility with older Python versions is required, adding an explicit length check before the loop would be a robust alternative.

Suggested change
for memory_obj, start, end in zip(memory_objs, starts, ends, strict=False):
for memory_obj, start, end in zip(memory_objs, starts, ends, strict=True):

Comment on lines +479 to +486
if torch.cuda.is_available():
torch_dev = torch.cuda
dev_name = "cuda"
elif torch.xpu.is_available():
torch_dev = torch.xpu
dev_name = "xpu"
else:
raise RuntimeError("Unsupported device platform for LMCache engine.")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to detect the available device (cuda or xpu) is duplicated in this function, both here and at lines 536-541. This could be refactored into a helper function to improve maintainability and reduce code duplication. For example, a function could return the device module and name, which can then be used in both places.

logger = init_logger(__name__)


class VLLMPagedMemXPUConnectorV2(VLLMPagedMemGPUConnectorV2):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

VLLMPagedMemXPUConnectorV2 inherits from VLLMPagedMemGPUConnectorV2, but it seems to override most of the core logic (__init__, to_gpu, from_gpu) and uses a different implementation strategy (pure PyTorch vs. custom CUDA kernels). The parent class also contains CUDA-specific code that makes direct inheritance for other devices problematic. This can lead to maintenance challenges.

A cleaner design might be for VLLMPagedMemXPUConnectorV2 to inherit directly from GPUConnectorInterface. Any logic that is truly common between the GPU and XPU connectors could be extracted into a shared base class or utility functions.

2. In the case that there is prefix caching, slot_mapping will starts
with -1s until the end of the matched prefix. The start and end
should NEVER overlap with the prefix caching (which means the
underlying CUDA kernel will never see -1 in slot_mapping)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring mentions a CUDA kernel, which appears to be a copy-paste artifact from the GPU connector. To maintain accuracy and clarity, this should be updated to be device-agnostic, for example, by simply referring to it as the underlying kernel.

Suggested change
underlying CUDA kernel will never see -1 in slot_mapping)
should NEVER overlap with the prefix caching (which means the
underlying kernel will never see -1 in slot_mapping)

Comment on lines +127 to +137
for i, kvcache in enumerate(kvcaches):
kvcache.view(total_blocks, head_size).index_copy_(0, slices, tmp[i])
else:
tmp_k = memory_obj.tensor[0].to(slot_mapping.device)
tmp_v = memory_obj.tensor[1].to(slot_mapping.device)
num_blocks, block_size, num_heads, head_size = kvcaches[0][0].shape
total_blocks = num_blocks * block_size
d = num_heads * head_size
for i, (kcache, vcache) in enumerate(kvcaches):
kcache.view(total_blocks, d).index_copy_(0, slices, tmp_k[i])
vcache.view(total_blocks, d).index_copy_(0, slices, tmp_v[i])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The explicit Python loop over layers to perform index_copy_ can introduce performance overhead, especially for models with a large number of layers. This is in contrast to the GPU connector which uses a single fused CUDA kernel for this operation. While this pure-PyTorch implementation is a great way to support XPU without custom kernels, it's worth noting this potential performance bottleneck. If performance becomes an issue, creating a batched version of this operation or a custom XPU kernel could be considered for optimization.

2. In the case that there is prefix caching, slot_mapping will starts
with -1s until the end of the matched prefix. The start and end
should NEVER overlap with the prefix caching (which means the
underlying CUDA kernel will never see -1 in slot_mapping)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to a previous comment, the docstring mentions a CUDA kernel. This should be updated to be device-agnostic to avoid confusion.

Suggested change
underlying CUDA kernel will never see -1 in slot_mapping)
should NEVER overlap with the prefix caching (which means the
underlying kernel will never see -1 in slot_mapping)

Comment on lines +173 to +195
tmp = torch.stack(
[
kvcache.view(total_blocks, head_size).index_select(0, slices)
for kvcache in kvcaches
]
)
else:
num_blocks, block_size, num_heads, head_size = kvcaches[0][0].shape
total_blocks = num_blocks * block_size
d = num_heads * head_size
tmp_k = torch.stack(
[
kvcache[0].view(total_blocks, d).index_select(0, slices)
for kvcache in kvcaches
]
)
tmp_v = torch.stack(
[
kvcache[1].view(total_blocks, d).index_select(0, slices)
for kvcache in kvcaches
]
)
tmp = torch.stack([tmp_k, tmp_v])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the to_gpu method, the use of list comprehensions here to iterate over layers ([... for kvcache in kvcaches]) effectively creates a Python loop. This can be a performance bottleneck compared to a fused kernel, especially for models with many layers. It's a good idea to be aware of this potential performance impact.

sammshen and others added 9 commits September 29, 2025 09:29
* initial

Signed-off-by: Samuel Shen <slshen@uchicago.edu>

* finished documenationgit branch

Signed-off-by: Samuel Shen <slshen@uchicago.edu>

* change color

Signed-off-by: Samuel Shen <slshen@uchicago.edu>

---------

Signed-off-by: Samuel Shen <slshen@uchicago.edu>
Co-authored-by: Samuel Shen <slshen@uchicago.edu>
…Cache#1609)

add --enforce-strict-concurrent-users option to allow user count enforcement in multi-round-qa.py
fix pin_count type

Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: Kobe Chen <xiaokunchen0@gmail.com>
Signed-off-by: Kobe Chen <xiaokunchen0@gmail.com>
* fix lookup

Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>

* fix format

Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>

* fix bugs

Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>

* fix small bugs

Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>

---------

Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
Signed-off-by: idellzheng <idellzheng@tencent.com>
…connector as first step (LMCache#1696)

* Implement remove api for remote backend and fs_connector as first step

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* Update lmcache/v1/storage_backend/connector/fs_connector.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: maobaolong <307499405@qq.com>

* Update lmcache/v1/storage_backend/remote_backend.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: maobaolong <307499405@qq.com>

* Update lmcache/v1/storage_backend/remote_backend.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: maobaolong <307499405@qq.com>

* Fix style

Signed-off-by: baoloongmao <baoloongmao@tencent.com>

* Add remove_sync method to InstrumentedRemoteConnector

* Add remove_sync method for key audit logging

---------

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: maobaolong <307499405@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Async redis cluster remote connector + unit tests

Signed-off-by: Lindsey Wen <lindwen@amazon.com>
Signed-off-by: Lindsey Wen <lindseywen@ucsb.edu>

* Run pre-commit

Signed-off-by: Lindsey Wen <lindseywen@ucsb.edu>

---------

Signed-off-by: Lindsey Wen <lindwen@amazon.com>
Signed-off-by: Lindsey Wen <lindseywen@ucsb.edu>
Co-authored-by: Lindsey Wen <lindwen@amazon.com>
dependabot Bot and others added 12 commits November 5, 2025 12:30
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.6.2 to 5.0.0.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@ea165f8...330a01c)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: 5.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…1937)

Follow up documentation / improvements to come.

* [Core] Add SageMaker HyperPod remote connector

Signed-off-by: Ziwen Ning <ningziwe@amazon.com>

* Address comments

Signed-off-by: Ziwen Ning <ningziwe@amazon.com>

---------

Signed-off-by: Ziwen Ning <ningziwe@amazon.com>
…okensInternalResult (LMCache#1953)

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
fix

Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>
* filter out 0 hit rate

Signed-off-by: idellzheng <idellzheng@tencent.com>

* rename

Signed-off-by: idellzheng <idellzheng@tencent.com>

---------

Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: Ziwen Ning <ningziwe@amazon.com>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
* [refactor] unified reconstruct cache engine key

Signed-off-by: idellzheng <idellzheng@tencent.com>

* checkstyle fix

Signed-off-by: idellzheng <idellzheng@tencent.com>

* move method to CacheEngineKey

Signed-off-by: idellzheng <idellzheng@tencent.com>

---------

Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
maobaolong and others added 16 commits November 7, 2025 18:33
Support get the env of current process

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Adapt hash func to vllm

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: Ziwen Ning <ningziwe@amazon.com>
LMCache#1827)

* [bugfix]: only update skip_leading_tokens on last PP rank in wait_for_save

[bugfix]: only update skip_leading_tokens on last PP rank in wait_for_save

Signed-off-by: 董佩琪 <dong.peiqi@neolink.com>

* format code with ruff

Signed-off-by: 董佩琪 <dong.peiqi@neolink.com>

* [bugfix]: only update skip_leading_tokens on last PP rank in wait_for_save

Signed-off-by: 董佩琪 <dong.peiqi@neolink.com>

---------

Signed-off-by: 董佩琪 <dong.peiqi@neolink.com>
Co-authored-by: 董佩琪 <dong.peiqi@neolink.com>
…_address (LMCache#1971)

LMCACHE_REMOTE_URL overrides master_server_address even when explicitly configured (mooncake HA mode issue)

Signed-off-by: 董佩琪 <dong.peiqi@neolink.com>
Co-authored-by: 董佩琪 <dong.peiqi@neolink.com>
…e#1965)

* [bugfix] do not start lmcache worker on scheduler

Signed-off-by: idellzheng <idellzheng@tencent.com>

* Update lmcache/v1/cache_engine.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: maobaolong <307499405@qq.com>

---------

Signed-off-by: idellzheng <idellzheng@tencent.com>
Signed-off-by: maobaolong <307499405@qq.com>
Co-authored-by: maobaolong <307499405@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [controller] simplify config when p2p is not enabled

Signed-off-by: idellzheng <idellzheng@tencent.com>

* checkstyle fix

Signed-off-by: idellzheng <idellzheng@tencent.com>

* optimize

Signed-off-by: idellzheng <idellzheng@tencent.com>

---------

Signed-off-by: idellzheng <idellzheng@tencent.com>
* [Model] Added support for Qwen2

Signed-off-by: Lee Yao Yang <leey0227@e.ntu.edu.sg>

* [Refactor] Remove duplicating code files for Qwen2 support

Signed-off-by: Lee Yao Yang <leey0227@e.ntu.edu.sg>

---------

Signed-off-by: Lee Yao Yang <leey0227@e.ntu.edu.sg>
Added office hours

Signed-off-by: Nicolas (Nick) Barcet <nicolas@barcet.com>
…MCache#1655)

feat(mooncake): add NUMA affinity and batched operations support

Signed-off-by: Jinyang Su <751080330@qq.com>
Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com>
* [optimize] optimize connection of remote backend

Signed-off-by: idellzheng <idellzheng@tencent.com>

* adapt remote monitor

Signed-off-by: idellzheng <idellzheng@tencent.com>

* checkstyle fix

Signed-off-by: idellzheng <idellzheng@tencent.com>

* do not initialize storage manager

Signed-off-by: idellzheng <idellzheng@tencent.com>

* remove remote backend change

Signed-off-by: idellzheng <idellzheng@tencent.com>

* checkstyle fix

Signed-off-by: idellzheng <idellzheng@tencent.com>

* checkstyle fix

Signed-off-by: idellzheng <idellzheng@tencent.com>

* bugfix

Signed-off-by: idellzheng <idellzheng@tencent.com>

---------

Signed-off-by: idellzheng <idellzheng@tencent.com>
* Introduce a dynamic expend memory allocator
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: Ziwen Ning <ningziwe@amazon.com>
* Add automerge-labeler.yml

Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu>

* Add permissions

Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu>

---------

Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu>
Limit the p2p latency threshold

Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.