Add XPU support to LMCache for CPU offloading by zhenwei-intel · Pull Request #1 · zhenwei-intel/LMCache

zhenwei-intel · 2025-09-10T08:24:05Z

TODO:

offload to cpu

offload to local disk

zhenwei-intel · 2025-09-10T08:29:12Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for Intel XPUs, which is a significant enhancement for platform compatibility. The changes primarily involve adding a Platform utility class for hardware detection and making various parts of the codebase platform-agnostic by replacing CUDA-specific calls with conditional logic. A new VLLMPagedMemXPUConnectorV2 is also added.

The overall approach is solid, but I've found a few critical issues where the optional import of lmcache.c_ops can lead to runtime errors on non-CUDA platforms, as the code doesn't handle the case where the import fails. I've also included some suggestions to improve code maintainability by reducing duplication and cleaning up comments.

gemini-code-assist · 2025-09-10T08:40:13Z

+        """Expect a kwarg 'kvcaches' which is a nested tuple of K and V tensors.
+        The kvcaches should correspond to the "WHOLE token sequence".
+
+        Note:
+          1. This function expects the 'slot_mapping' is a "full slot mapping"
+             where it's length is the same as the whole token sequence.
+          2. In the case that there is prefix caching, slot_mapping will starts
+             with -1s until the end of the matched prefix. The start and end
+             should NEVER overlap with the prefix caching (which means the
+             underlying CUDA kernel will never see -1 in slot_mapping)
+
+
+        :raises ValueError: If 'kvcaches' is not provided in kwargs.
+        :raises AssertionError: If the memory object does not have a tensor.
+        :raises ValueError: If 'slot_mapping' is not provided in kwargs.
+        """


The docstring for to_gpu mentions "CUDA kernel" on line 121. Since this connector is for XPU, this is misleading. Please update the comment to be platform-agnostic. For example, you could replace "CUDA kernel" with "computation kernel" or "underlying kernel". The same issue exists in the docstring for from_gpu on line 177.

zhenwei-intel · 2025-09-10T12:34:06Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for Intel XPUs by adding a Platform utility to abstract device-specific logic. It includes a new VLLMPagedMemXPUConnectorV2 for XPU integration and updates various parts of the codebase to be device-agnostic. The changes are well-structured. However, there are critical issues where the optional import of CUDA-specific C operations (lmc_ops) is not handled correctly, which will lead to crashes on non-CUDA platforms like XPU when certain features (e.g., cachegen serde) are used. There are also some minor code cleanup opportunities in the new XPU connector.

zhenwei-intel · 2025-09-15T08:59:38Z

/gemini review

gemini-code-assist

Code Review

This pull request adds XPU support to LMCache, primarily by introducing a Platform utility for device detection and implementing a new VLLMPagedMemXPUConnectorV2 for XPU devices. The changes are well-structured to abstract away device-specific logic. However, I've found a critical issue with stream initialization for XPU that could lead to a runtime error. Additionally, there are opportunities to improve maintainability by reducing code duplication and to enhance clarity by correcting docstrings and error messages in the new XPU connector. Addressing these points will strengthen the XPU integration.

[bugfix] fix memory lead in fs connector bugfix Signed-off-by: idellzheng <idellzheng@tencent.com>

* add flash infer Signed-off-by: Samuel Shen <slshen@uchicago.edu> * missed one Signed-off-by: Samuel Shen <slshen@uchicago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchicago.edu> Co-authored-by: Samuel Shen <slshen@uchicago.edu>

zhenwei-intel · 2025-09-29T12:17:22Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for Intel XPUs to LMCache, which is a valuable addition for enabling CPU offloading on different hardware. The changes primarily involve abstracting device-specific code (CUDA vs. XPU) and adding a new VLLMPagedMemXPUConnectorV2 for XPU devices. The overall approach is solid, and the device-agnostic changes like replacing .cuda() with .to(device) are well-executed. My review focuses on improving maintainability by addressing code duplication, fixing minor documentation errors, and highlighting potential performance and correctness issues in the new XPU connector.

gemini-code-assist · 2025-09-29T12:19:38Z

+
+    # TODO(Jiayi): need to optimize to enable real batching
+    def batched_to_gpu(self, memory_objs, starts, ends, **kwargs):
+        for memory_obj, start, end in zip(memory_objs, starts, ends, strict=False):


Using zip with strict=False can hide potential bugs if the input iterables (memory_objs, starts, ends) have mismatched lengths, as it will silently truncate to the shortest one. It is safer to use strict=True (available in Python 3.10+) to raise a ValueError in such cases. If compatibility with older Python versions is required, adding an explicit length check before the loop would be a robust alternative.

Suggested change

for memory_obj, start, end in zip(memory_objs, starts, ends, strict=False):

for memory_obj, start, end in zip(memory_objs, starts, ends, strict=True):

gemini-code-assist · 2025-09-29T12:19:38Z

+    if torch.cuda.is_available():
+        torch_dev = torch.cuda
+        dev_name = "cuda"
+    elif torch.xpu.is_available():
+        torch_dev = torch.xpu
+        dev_name = "xpu"
+    else:
+        raise RuntimeError("Unsupported device platform for LMCache engine.")


The logic to detect the available device (cuda or xpu) is duplicated in this function, both here and at lines 536-541. This could be refactored into a helper function to improve maintainability and reduce code duplication. For example, a function could return the device module and name, which can then be used in both places.

gemini-code-assist · 2025-09-29T12:19:38Z

+logger = init_logger(__name__)
+
+
+class VLLMPagedMemXPUConnectorV2(VLLMPagedMemGPUConnectorV2):


VLLMPagedMemXPUConnectorV2 inherits from VLLMPagedMemGPUConnectorV2, but it seems to override most of the core logic (__init__, to_gpu, from_gpu) and uses a different implementation strategy (pure PyTorch vs. custom CUDA kernels). The parent class also contains CUDA-specific code that makes direct inheritance for other devices problematic. This can lead to maintenance challenges.

A cleaner design might be for VLLMPagedMemXPUConnectorV2 to inherit directly from GPUConnectorInterface. Any logic that is truly common between the GPU and XPU connectors could be extracted into a shared base class or utility functions.

gemini-code-assist · 2025-09-29T12:19:38Z

+          2. In the case that there is prefix caching, slot_mapping will starts
+             with -1s until the end of the matched prefix. The start and end
+             should NEVER overlap with the prefix caching (which means the
+             underlying CUDA kernel will never see -1 in slot_mapping)


The docstring mentions a CUDA kernel, which appears to be a copy-paste artifact from the GPU connector. To maintain accuracy and clarity, this should be updated to be device-agnostic, for example, by simply referring to it as the underlying kernel.

Suggested change

underlying CUDA kernel will never see -1 in slot_mapping)

should NEVER overlap with the prefix caching (which means the

underlying kernel will never see -1 in slot_mapping)

gemini-code-assist · 2025-09-29T12:19:38Z

+            for i, kvcache in enumerate(kvcaches):
+                kvcache.view(total_blocks, head_size).index_copy_(0, slices, tmp[i])
+        else:
+            tmp_k = memory_obj.tensor[0].to(slot_mapping.device)
+            tmp_v = memory_obj.tensor[1].to(slot_mapping.device)
+            num_blocks, block_size, num_heads, head_size = kvcaches[0][0].shape
+            total_blocks = num_blocks * block_size
+            d = num_heads * head_size
+            for i, (kcache, vcache) in enumerate(kvcaches):
+                kcache.view(total_blocks, d).index_copy_(0, slices, tmp_k[i])
+                vcache.view(total_blocks, d).index_copy_(0, slices, tmp_v[i])


The explicit Python loop over layers to perform index_copy_ can introduce performance overhead, especially for models with a large number of layers. This is in contrast to the GPU connector which uses a single fused CUDA kernel for this operation. While this pure-PyTorch implementation is a great way to support XPU without custom kernels, it's worth noting this potential performance bottleneck. If performance becomes an issue, creating a batched version of this operation or a custom XPU kernel could be considered for optimization.

gemini-code-assist · 2025-09-29T12:19:38Z

+          2. In the case that there is prefix caching, slot_mapping will starts
+             with -1s until the end of the matched prefix. The start and end
+             should NEVER overlap with the prefix caching (which means the
+             underlying CUDA kernel will never see -1 in slot_mapping)


Similar to a previous comment, the docstring mentions a CUDA kernel. This should be updated to be device-agnostic to avoid confusion.

Suggested change

underlying CUDA kernel will never see -1 in slot_mapping)

should NEVER overlap with the prefix caching (which means the

underlying kernel will never see -1 in slot_mapping)

gemini-code-assist · 2025-09-29T12:19:38Z

+            tmp = torch.stack(
+                [
+                    kvcache.view(total_blocks, head_size).index_select(0, slices)
+                    for kvcache in kvcaches
+                ]
+            )
+        else:
+            num_blocks, block_size, num_heads, head_size = kvcaches[0][0].shape
+            total_blocks = num_blocks * block_size
+            d = num_heads * head_size
+            tmp_k = torch.stack(
+                [
+                    kvcache[0].view(total_blocks, d).index_select(0, slices)
+                    for kvcache in kvcaches
+                ]
+            )
+            tmp_v = torch.stack(
+                [
+                    kvcache[1].view(total_blocks, d).index_select(0, slices)
+                    for kvcache in kvcaches
+                ]
+            )
+            tmp = torch.stack([tmp_k, tmp_v])


Similar to the to_gpu method, the use of list comprehensions here to iterate over layers ([... for kvcache in kvcaches]) effectively creates a Python loop. This can be a performance bottleneck compared to a fused kernel, especially for models with many layers. It's a good idea to be aware of this potential performance impact.

* initial Signed-off-by: Samuel Shen <slshen@uchicago.edu> * finished documenationgit branch Signed-off-by: Samuel Shen <slshen@uchicago.edu> * change color Signed-off-by: Samuel Shen <slshen@uchicago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchicago.edu> Co-authored-by: Samuel Shen <slshen@uchicago.edu>

…Cache#1609) add --enforce-strict-concurrent-users option to allow user count enforcement in multi-round-qa.py

fix pin_count type Signed-off-by: idellzheng <idellzheng@tencent.com>

Signed-off-by: Kobe Chen <xiaokunchen0@gmail.com>

* fix lookup Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> * fix format Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> * fix bugs Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> * fix small bugs Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn> --------- Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>

Signed-off-by: idellzheng <idellzheng@tencent.com>

…connector as first step (LMCache#1696) * Implement remove api for remote backend and fs_connector as first step Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Update lmcache/v1/storage_backend/connector/fs_connector.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: maobaolong <307499405@qq.com> * Update lmcache/v1/storage_backend/remote_backend.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: maobaolong <307499405@qq.com> * Update lmcache/v1/storage_backend/remote_backend.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: maobaolong <307499405@qq.com> * Fix style Signed-off-by: baoloongmao <baoloongmao@tencent.com> * Add remove_sync method to InstrumentedRemoteConnector * Add remove_sync method for key audit logging --------- Signed-off-by: baoloongmao <baoloongmao@tencent.com> Signed-off-by: maobaolong <307499405@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Async redis cluster remote connector + unit tests Signed-off-by: Lindsey Wen <lindwen@amazon.com> Signed-off-by: Lindsey Wen <lindseywen@ucsb.edu> * Run pre-commit Signed-off-by: Lindsey Wen <lindseywen@ucsb.edu> --------- Signed-off-by: Lindsey Wen <lindwen@amazon.com> Signed-off-by: Lindsey Wen <lindseywen@ucsb.edu> Co-authored-by: Lindsey Wen <lindwen@amazon.com>

Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.6.2 to 5.0.0. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@ea165f8...330a01c) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-version: 5.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…1937) Follow up documentation / improvements to come. * [Core] Add SageMaker HyperPod remote connector Signed-off-by: Ziwen Ning <ningziwe@amazon.com> * Address comments Signed-off-by: Ziwen Ning <ningziwe@amazon.com> --------- Signed-off-by: Ziwen Ning <ningziwe@amazon.com>

…okensInternalResult (LMCache#1953) Signed-off-by: baoloongmao <baoloongmao@tencent.com>

fix Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>

* filter out 0 hit rate Signed-off-by: idellzheng <idellzheng@tencent.com> * rename Signed-off-by: idellzheng <idellzheng@tencent.com> --------- Signed-off-by: idellzheng <idellzheng@tencent.com>

Signed-off-by: Ziwen Ning <ningziwe@amazon.com>

Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>

* [refactor] unified reconstruct cache engine key Signed-off-by: idellzheng <idellzheng@tencent.com> * checkstyle fix Signed-off-by: idellzheng <idellzheng@tencent.com> * move method to CacheEngineKey Signed-off-by: idellzheng <idellzheng@tencent.com> --------- Signed-off-by: idellzheng <idellzheng@tencent.com>

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

Support get the env of current process Signed-off-by: baoloongmao <baoloongmao@tencent.com>

Adapt hash func to vllm Signed-off-by: baoloongmao <baoloongmao@tencent.com>

Signed-off-by: Ziwen Ning <ningziwe@amazon.com>

LMCache#1827) * [bugfix]: only update skip_leading_tokens on last PP rank in wait_for_save [bugfix]: only update skip_leading_tokens on last PP rank in wait_for_save Signed-off-by: 董佩琪 <dong.peiqi@neolink.com> * format code with ruff Signed-off-by: 董佩琪 <dong.peiqi@neolink.com> * [bugfix]: only update skip_leading_tokens on last PP rank in wait_for_save Signed-off-by: 董佩琪 <dong.peiqi@neolink.com> --------- Signed-off-by: 董佩琪 <dong.peiqi@neolink.com> Co-authored-by: 董佩琪 <dong.peiqi@neolink.com>

…_address (LMCache#1971) LMCACHE_REMOTE_URL overrides master_server_address even when explicitly configured (mooncake HA mode issue) Signed-off-by: 董佩琪 <dong.peiqi@neolink.com> Co-authored-by: 董佩琪 <dong.peiqi@neolink.com>

…e#1965) * [bugfix] do not start lmcache worker on scheduler Signed-off-by: idellzheng <idellzheng@tencent.com> * Update lmcache/v1/cache_engine.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: maobaolong <307499405@qq.com> --------- Signed-off-by: idellzheng <idellzheng@tencent.com> Signed-off-by: maobaolong <307499405@qq.com> Co-authored-by: maobaolong <307499405@qq.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [controller] simplify config when p2p is not enabled Signed-off-by: idellzheng <idellzheng@tencent.com> * checkstyle fix Signed-off-by: idellzheng <idellzheng@tencent.com> * optimize Signed-off-by: idellzheng <idellzheng@tencent.com> --------- Signed-off-by: idellzheng <idellzheng@tencent.com>

* [Model] Added support for Qwen2 Signed-off-by: Lee Yao Yang <leey0227@e.ntu.edu.sg> * [Refactor] Remove duplicating code files for Qwen2 support Signed-off-by: Lee Yao Yang <leey0227@e.ntu.edu.sg> --------- Signed-off-by: Lee Yao Yang <leey0227@e.ntu.edu.sg>

Added office hours Signed-off-by: Nicolas (Nick) Barcet <nicolas@barcet.com>

…MCache#1655) feat(mooncake): add NUMA affinity and batched operations support Signed-off-by: Jinyang Su <751080330@qq.com> Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com>

* [optimize] optimize connection of remote backend Signed-off-by: idellzheng <idellzheng@tencent.com> * adapt remote monitor Signed-off-by: idellzheng <idellzheng@tencent.com> * checkstyle fix Signed-off-by: idellzheng <idellzheng@tencent.com> * do not initialize storage manager Signed-off-by: idellzheng <idellzheng@tencent.com> * remove remote backend change Signed-off-by: idellzheng <idellzheng@tencent.com> * checkstyle fix Signed-off-by: idellzheng <idellzheng@tencent.com> * checkstyle fix Signed-off-by: idellzheng <idellzheng@tencent.com> * bugfix Signed-off-by: idellzheng <idellzheng@tencent.com> --------- Signed-off-by: idellzheng <idellzheng@tencent.com>

* Introduce a dynamic expend memory allocator Signed-off-by: baoloongmao <baoloongmao@tencent.com>

Signed-off-by: Ziwen Ning <ningziwe@amazon.com>

* Add automerge-labeler.yml Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu> * Add permissions Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu> --------- Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu>

Limit the p2p latency threshold Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu>

zhenwei-intel requested a review from Copilot September 10, 2025 08:36

This comment was marked as duplicate.

Sign in to view

gemini-code-assist Bot reviewed Sep 10, 2025

View reviewed changes

zhenwei-intel changed the title ~~Add xpu~~ Add XPU support to LMCache for CPU offloading Sep 11, 2025

zhenwei-intel force-pushed the add_xpu branch 3 times, most recently from 44807c9 to c65bb1f Compare September 15, 2025 07:49

gemini-code-assist Bot reviewed Sep 15, 2025

View reviewed changes

Comment thread lmcache/v1/storage_backend/storage_manager.py Outdated

Comment thread lmcache/v1/xpu_connector.py

Comment thread lmcache/v1/xpu_connector.py

Comment thread lmcache/v1/xpu_connector.py

Comment thread lmcache/v1/xpu_connector.py

zhenwei-intel force-pushed the add_xpu branch 4 times, most recently from a1f5f64 to 86ddb33 Compare September 15, 2025 12:03

chunxiaozheng and others added 2 commits September 29, 2025 10:19

[bugfix] fix memory leak in fs connector (LMCache#1656)

217f996

[bugfix] fix memory lead in fs connector bugfix Signed-off-by: idellzheng <idellzheng@tencent.com>

[CD]: add flash infer (LMCache#1699)

c46e7e8

* add flash infer Signed-off-by: Samuel Shen <slshen@uchicago.edu> * missed one Signed-off-by: Samuel Shen <slshen@uchicago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchicago.edu> Co-authored-by: Samuel Shen <slshen@uchicago.edu>

zhenwei-intel force-pushed the add_xpu branch from 86ddb33 to e373da8 Compare September 29, 2025 12:08

gemini-code-assist Bot reviewed Sep 29, 2025

View reviewed changes

sammshen and others added 9 commits September 29, 2025 09:29

[Benchmark] configure user stacking at high qps in multi-round-qa (LM…

a2c5ba5

…Cache#1609) add --enforce-strict-concurrent-users option to allow user count enforcement in multi-round-qa.py

[minor fix]fix pin_count type (LMCache#1685)

89ce712

fix pin_count type Signed-off-by: idellzheng <idellzheng@tencent.com>

[Docs] Add minimal runnable quick start guide (LMCache#1725)

7afee73

Signed-off-by: Kobe Chen <xiaokunchen0@gmail.com>

[Benchmark] Emphasize query round results in long_doc_qa (LMCache#1730)

4e5d036

Signed-off-by: Kobe Chen <xiaokunchen0@gmail.com>

[minor fix] use parent __init__ to init meta (LMCache#1716)

08ccb7d

Signed-off-by: idellzheng <idellzheng@tencent.com>

dependabot Bot and others added 12 commits November 5, 2025 12:30

refactor: Use type alias instead of tuple for ProcessedChunk/ProcessT…

f4c0e2a

…okensInternalResult (LMCache#1953) Signed-off-by: baoloongmao <baoloongmao@tencent.com>

[Bugfix] Fix key comparison (LMCache#1954)

fe3ca1b

fix Signed-off-by: YaoJiayi <120040070@link.cuhk.edu.cn>

[metrics]filter out 0 hit rate (LMCache#1921)

6b7e3e9

* filter out 0 hit rate Signed-off-by: idellzheng <idellzheng@tencent.com> * rename Signed-off-by: idellzheng <idellzheng@tencent.com> --------- Signed-off-by: idellzheng <idellzheng@tencent.com>

[CI/Build] Add /opt/venv/bin to PATH (LMCache#1925)

df9dcd0

Signed-off-by: Ziwen Ning <ningziwe@amazon.com>

[logging]: less verbose async logs (LMCache#1820)

ca1cb9d

Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>

Add XPU support to LMCache for CPU/disk offloading

95862fb

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

update device and log info

9934b86

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

remove _lmcache_nvtx_annotate

2e72c1e

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

using vllm current platform

866d32b

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

zhenwei-intel force-pushed the add_xpu branch from 32f673f to 866d32b Compare November 7, 2025 01:55

maobaolong and others added 16 commits November 7, 2025 18:33

[feat] Support get the env of current process (LMCache#1944)

23bddb9

Support get the env of current process Signed-off-by: baoloongmao <baoloongmao@tencent.com>

Adapt hash func to vllm recent changes (LMCache#1952)

a64ca32

Adapt hash func to vllm Signed-off-by: baoloongmao <baoloongmao@tencent.com>

[CI/Build] Move /opt/venv/bin to PATH to BASE image (LMCache#1962)

59102a5

Signed-off-by: Ziwen Ning <ningziwe@amazon.com>

Update meetings.rst (LMCache#1994)

ab85309

Added office hours Signed-off-by: Nicolas (Nick) Barcet <nicolas@barcet.com>

[Core][Mooncake]: add NUMA affinity and batched operations support (L…

e01afb3

…MCache#1655) feat(mooncake): add NUMA affinity and batched operations support Signed-off-by: Jinyang Su <751080330@qq.com> Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com>

Introduce a dynamic expend memory allocator (LMCache#1899)

cd04aa0

* Introduce a dynamic expend memory allocator Signed-off-by: baoloongmao <baoloongmao@tencent.com>

[fix] Fix HyperPod connector release lease API (LMCache#1968)

dfa0503

Signed-off-by: Ziwen Ning <ningziwe@amazon.com>

[CI/Build] Support for required comprehensive test (LMCache#2001)

8ab94fd

* Add automerge-labeler.yml Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu> * Add permissions Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu> --------- Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu>

[CI/Build] Limit the p2p latency threshold (LMCache#2003)

6f7327a

Limit the p2p latency threshold Signed-off-by: Shaoting-Feng <shaotingf@uchicago.edu>

Merge branch 'dev' into add_xpu

2519583

zhenwei-intel closed this Dec 12, 2025

	for memory_obj, start, end in zip(memory_objs, starts, ends, strict=False):
	for memory_obj, start, end in zip(memory_objs, starts, ends, strict=True):

		logger = init_logger(__name__)


		class VLLMPagedMemXPUConnectorV2(VLLMPagedMemGPUConnectorV2):

	underlying CUDA kernel will never see -1 in slot_mapping)
	should NEVER overlap with the prefix caching (which means the
	underlying kernel will never see -1 in slot_mapping)

Conversation

zhenwei-intel commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

offload to cpu

offload to local disk

Uh oh!

zhenwei-intel commented Sep 10, 2025

Uh oh!

This comment was marked as duplicate.

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhenwei-intel commented Sep 10, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhenwei-intel commented Sep 15, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhenwei-intel commented Sep 29, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

zhenwei-intel commented Sep 10, 2025 •

edited

Loading