-
- PR Checklist (Click to Expand)
-
-Thank you for your contribution to LMCache! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.
-
-PR Title and Classification
-Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
-
- [Bugfix] for bug fixes.
- [CI/Build] for build or continuous integration improvements.
- [Doc] for documentation fixes and improvements.
- [Model] for adding a new model or improving an existing model. Model name should appear in the title.
- [Core] for changes in the core LMCache logic (e.g., LMCacheEngine, Backend etc.)
- [Misc] for PRs that do not fit the above categories. Please use this sparingly.
-
-Note: If the PR spans more than one category, please include all relevant prefixes.
-
-Code Quality
-
-The PR need to meet the following code quality standards:
-
-
- - The code need to be well-documented to ensure future contributors can easily understand the code.
- - Please include sufficient unit tests to ensure the change is stay correct and robust. The unit and integration tests will always run and our comprehensive test will be triggered after the "full" label is tagged onto a PR.
-
-
-What to Expect for the Reviews
-
-We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of KuntaiDu, ApostaC or YaoJiayi.
-
-
+- [ ] this PR contains user facing changes - docs added
+- [ ] this PR contains unit tests
diff --git a/.github/workflows/automerge-labeler.yml b/.github/workflows/automerge-labeler.yml
new file mode 100644
index 00000000000..d3b49a55784
--- /dev/null
+++ b/.github/workflows/automerge-labeler.yml
@@ -0,0 +1,17 @@
+name: Label auto-merge PRs
+
+on:
+ pull_request_target:
+ types: [ auto_merge_enabled, auto_merge_disabled ]
+
+permissions:
+ pull-requests: write
+
+jobs:
+ add_remove_labels:
+ runs-on: ubuntu-latest
+ steps:
+ - uses: ubuntudroid/automerge-labeler@v1
+ with:
+ token: ${{ secrets.GITHUB_TOKEN }}
+ label: 'full'
diff --git a/.github/workflows/build_doc.yml b/.github/workflows/build_doc.yml
index d66af4ed803..ef4b1f44981 100644
--- a/.github/workflows/build_doc.yml
+++ b/.github/workflows/build_doc.yml
@@ -53,7 +53,7 @@ jobs:
rm -rf output/dev
- name: Upload doc artifacts to GHA
- uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
+ uses: actions/upload-artifact@330a01c490aca151604b8cf639adc76d48f6c5d4 # v5.0.0
with:
name: doc-artifacts
path: output/
@@ -69,7 +69,7 @@ jobs:
egress-policy: audit # TODO: change to 'egress-policy: block' after couple of runs
- name: Fetch doc artifacts
- uses: actions/download-artifact@634f93cb2916e3fdff6788551b99b062d0335ce0 # v5.0.0
+ uses: actions/download-artifact@018cc2cf5baa6db3ef3c5f8a56943fffe632ef53 # v6.0.0
with:
name: doc-artifacts
path: output
diff --git a/.github/workflows/code_quality_checks.yml b/.github/workflows/code_quality_checks.yml
index 16cb6c0a17c..f1df04c88cc 100644
--- a/.github/workflows/code_quality_checks.yml
+++ b/.github/workflows/code_quality_checks.yml
@@ -1,6 +1,7 @@
name: Code Quality
on:
+ workflow_call:
pull_request:
push:
branches: [dev]
diff --git a/.github/workflows/codeql.yml b/.github/workflows/codeql.yml
index ed0d7485b38..ccbf598b8ba 100644
--- a/.github/workflows/codeql.yml
+++ b/.github/workflows/codeql.yml
@@ -103,7 +103,7 @@ jobs:
# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL
- uses: github/codeql-action/init@181d5eefc20863364f96762470ba6f862bdef56b # v3.29.2
+ uses: github/codeql-action/init@f443b600d91635bebf5b0d9ebc620189c0d6fba5 # v4.30.8
with:
languages: ${{ matrix.language }}
build-mode: ${{ matrix.build-mode }}
@@ -131,6 +131,6 @@ jobs:
exit 1
- name: Perform CodeQL Analysis
- uses: github/codeql-action/analyze@181d5eefc20863364f96762470ba6f862bdef56b # v3.29.2
+ uses: github/codeql-action/analyze@f443b600d91635bebf5b0d9ebc620189c0d6fba5 # v4.30.8
with:
category: "/language:${{matrix.language}}"
diff --git a/.github/workflows/nightly_build.yml b/.github/workflows/nightly_build.yml
index 06912e874ab..51549598e2e 100644
--- a/.github/workflows/nightly_build.yml
+++ b/.github/workflows/nightly_build.yml
@@ -35,7 +35,7 @@ jobs:
astral.sh:443
- name: Login to DockerHub
- uses: docker/login-action@184bdaa0721073962dff0199f1fb9940f07167d1 # v3.5.0
+ uses: docker/login-action@5e57cd118135c172c3672efd75eb46360885c0ef # v3.6.0
with:
username: ${{ vars.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
index 7ea008366f7..cec8ca9a576 100644
--- a/.github/workflows/publish.yml
+++ b/.github/workflows/publish.yml
@@ -90,11 +90,20 @@ jobs:
python -m cibuildwheel --output-dir dist
- name: Upload release artifacts to GHA
- uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
+ uses: actions/upload-artifact@330a01c490aca151604b8cf639adc76d48f6c5d4 # v5.0.0
with:
name: release-artifacts
path: dist/
+ # Run tests and code quality checks before publishing
+ test:
+ name: Run tests
+ uses: ./.github/workflows/test.yml
+
+ code-quality:
+ name: Run code quality checks
+ uses: ./.github/workflows/code_quality_checks.yml
+
# Push to Test PyPI when:
# - a new GitHub release is published
# - a PR is merged into dev branch (push only trigger)
@@ -110,7 +119,7 @@ jobs:
# see https://docs.pypi.org/trusted-publishers/
id-token: write
runs-on: ubuntu-latest
- needs: build-artifacts
+ needs: [build-artifacts, test, code-quality]
steps:
- name: Harden Runner
@@ -127,7 +136,7 @@ jobs:
rekor.sigstore.dev:443
- name: Fetch release artifacts
- uses: actions/download-artifact@634f93cb2916e3fdff6788551b99b062d0335ce0 # v5.0.0
+ uses: actions/download-artifact@018cc2cf5baa6db3ef3c5f8a56943fffe632ef53 # v6.0.0
with:
name: release-artifacts
path: dist
@@ -151,7 +160,7 @@ jobs:
contents: write
runs-on: ubuntu-latest
- needs: build-artifacts
+ needs: [build-artifacts, test, code-quality]
steps:
- name: Harden Runner
@@ -170,7 +179,7 @@ jobs:
rekor.sigstore.dev:443
- name: Fetch release artifacts
- uses: actions/download-artifact@634f93cb2916e3fdff6788551b99b062d0335ce0 # v5.0.0
+ uses: actions/download-artifact@018cc2cf5baa6db3ef3c5f8a56943fffe632ef53 # v6.0.0
with:
name: release-artifacts
path: dist
@@ -220,7 +229,7 @@ jobs:
layers.nvcr.io:443
- name: Login to DockerHub
- uses: docker/login-action@184bdaa0721073962dff0199f1fb9940f07167d1 # v3.5.0
+ uses: docker/login-action@5e57cd118135c172c3672efd75eb46360885c0ef # v3.6.0
with:
username: ${{ vars.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
@@ -258,7 +267,7 @@ jobs:
run: |
docker build \
--tag lmcache/vllm-openai:lightweight --tag lmcache/vllm-openai:${{ env.LATEST_TAG }}-lightweight \
- --file docker/Dockerfile.lightweight
+ --file docker/Dockerfile.lightweight .
- name: Push lmcache/vllm-openai:lightweight image to DockerHub
run: |
diff --git a/.github/workflows/scorecard.yml b/.github/workflows/scorecard.yml
index c2cfd7eb61a..a1477e73c90 100644
--- a/.github/workflows/scorecard.yml
+++ b/.github/workflows/scorecard.yml
@@ -58,7 +58,7 @@ jobs:
persist-credentials: false
- name: "Run analysis"
- uses: ossf/scorecard-action@05b42c624433fc40578a4040d5cf5e36ddca8cde # v2.4.2
+ uses: ossf/scorecard-action@4eaacf0543bb3f2c246792bd56e8cdeffafb205a # v2.4.3
with:
results_file: results.sarif
results_format: sarif
@@ -83,7 +83,7 @@ jobs:
# Upload the results as artifacts (optional). Commenting out will disable uploads of run results in SARIF
# format to the repository Actions tab.
- name: "Upload artifact"
- uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
+ uses: actions/upload-artifact@330a01c490aca151604b8cf639adc76d48f6c5d4 # v5.0.0
with:
name: SARIF file
path: results.sarif
@@ -92,6 +92,6 @@ jobs:
# Upload the results to GitHub's code scanning dashboard (optional).
# Commenting out will disable upload of results to your repo's Code Scanning dashboard
- name: "Upload to code-scanning"
- uses: github/codeql-action/upload-sarif@v3
+ uses: github/codeql-action/upload-sarif@v4
with:
sarif_file: results.sarif
diff --git a/.github/workflows/stale_bot.yml b/.github/workflows/stale_bot.yml
index 2afdb65c9eb..70671f05294 100644
--- a/.github/workflows/stale_bot.yml
+++ b/.github/workflows/stale_bot.yml
@@ -30,7 +30,7 @@ jobs:
api.github.com:443
- name: "Stale Action"
- uses: actions/stale@3a9db7e6a41a89f618792c92c0e97cc736e1b13f # v10.0.0
+ uses: actions/stale@5f858e3efba33a5ca4407a664cc011ad407f2008 # v10.1.0
with:
stale-issue-label: 'stale'
stale-issue-message: >
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
index a62a0a04004..22c2617dfd7 100644
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -1,6 +1,7 @@
name: Test
on:
+ workflow_call:
workflow_dispatch:
push:
branches:
@@ -38,8 +39,6 @@ jobs:
strategy:
matrix:
python:
- # Disable 3.9 until code supports it in https://github.com/LMCache/LMCache/pull/1584
- # - "3.9"
- "3.10"
- "3.11"
- "3.12"
@@ -49,7 +48,7 @@ jobs:
steps:
- name: "Harden Runner"
- uses: step-security/harden-runner@ec9f2d5744a09debf3a187a3f4f675c53b671911 # v2.13.0
+ uses: step-security/harden-runner@f4a75cfd619ee5ce8d5b864b0d183aff3c69b55a # v2.13.1
with:
egress-policy: audit # TODO: change to 'egress-policy: block' after couple of runs
@@ -63,7 +62,7 @@ jobs:
uses: ./.github/actions/free-disk-space
- name: Setup Python ${{ matrix.python }}
- uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
+ uses: actions/setup-python@e797f83bcb11b83ae66e0230d6156d7c80228e7c # v6.0.0
with:
python-version: ${{ matrix.python }}
cache: pip
@@ -74,11 +73,11 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
+ python -m pip install vllm
python -m pip install -r requirements/test.txt
python -m pip install -r requirements/common.txt
- python -m pip install torch==2.7.1 torchaudio==2.7.1 torchvision==0.22.1
- - name: "Run non-CUDA unit tests (v1/storage_backend)"
+ - name: "Run non-CUDA unit tests"
run: |
- pytest tests/v1/storage_backend/
+ pytest --ignore=tests/disagg --ignore=tests/v1/test_nixl_storage.py --ignore=tests/v1/multiprocess/test_cache_server.py
diff --git a/.gitignore b/.gitignore
index 85ee11dd2b1..e6d10e57f8e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -79,6 +79,9 @@ lmcache/experimental/tests
/examples/offline_inference/buggy_example.py
/examples/test_example
+# benchmark results
+*.csv
+
# disk cache
/remote_disk
/local_disk
diff --git a/README.md b/README.md
index 77714598c2f..64384847672 100644
--- a/README.md
+++ b/README.md
@@ -26,7 +26,7 @@
| [**Blog**](https://blog.lmcache.ai/)
| [**Documentation**](https://docs.lmcache.ai/)
-| [**Join Slack**](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-3bgx768yd-H8WkOTmPtbxVYJ5nuZ4dmA)
+| [**Join Slack**](https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-3g8e6xzz8-KzS_HI8bPERGFK5PTB~MYg)
| [**Interest Form**](https://forms.gle/MHwLiYDU6kcW3dLj7)
| [**Roadmap**](https://github.com/LMCache/LMCache/issues/1253)
@@ -47,6 +47,7 @@ By combining LMCache with vLLM, developers achieve 3-10x delay savings and GPU c
* High performance CPU KVCache offloading
* Disaggregated prefill
* P2P KVCache sharing
+- [x] Integration with SGLang for KV cache offloading
- [x] LMCache is supported in the [vLLM production stack](https://github.com/vllm-project/production-stack/), [llm-d](https://github.com/llm-d/llm-d/), and [KServe](https://github.com/kserve/kserve)
- [x] Stable support for non-prefix KV caches
- [x] Storage support as follows:
@@ -131,6 +132,13 @@ If you use LMCache for your research, please cite our papers:
booktitle = {Proceedings of the Twentieth European Conference on Computer Systems},
pages = {94–109},
}
+
+@article{cheng2025lmcache,
+ title={LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference},
+ author={Cheng, Yihua and Liu, Yuhan and Yao, Jiayi and An, Yuwei and Chen, Xiaokun and Feng, Shaoting and Huang, Yuyang and Shen, Samuel and Du, Kuntai and Jiang, Junchen},
+ journal={arXiv preprint arXiv:2510.09665},
+ year={2025}
+}
```
## Socials
diff --git a/benchmarks/long_doc_qa/long_doc_qa.py b/benchmarks/long_doc_qa/long_doc_qa.py
index 9757ab8ad1c..2de5e2cc820 100644
--- a/benchmarks/long_doc_qa/long_doc_qa.py
+++ b/benchmarks/long_doc_qa/long_doc_qa.py
@@ -487,12 +487,10 @@ async def main(args):
query_mean_ttft = benchmark_df["ttft"].mean()
CSI = "\x1b["
RESET = CSI + "0m"
+ print(f"Warmup round mean TTFT: {warmup_mean_ttft:.3f}s")
+ print(f"Warmup round time: {warmup_end_time - warmup_start_time:.3f}s")
+ print(f"Warmup round prompt count: {len(warmup_df)}")
print(f"{CSI}36;1m\n=== BENCHMARK RESULTS ==={RESET}")
- print(f"{CSI}32mWarmup round mean TTFT: {warmup_mean_ttft:.3f}s{RESET}")
- print(
- f"{CSI}33mWarmup round time: {warmup_end_time - warmup_start_time:.3f}s{RESET}"
- )
- print(f"{CSI}35mWarmup round prompt count: {len(warmup_df)}{RESET}")
print(f"{CSI}32mQuery round mean TTFT: {query_mean_ttft:.3f}s{RESET}")
print(
f"{CSI}33mQuery round time: "
diff --git a/benchmarks/long_doc_qa/long_doc_qa_recommender.py b/benchmarks/long_doc_qa/long_doc_qa_recommender.py
index a900e59e217..568517a2caf 100644
--- a/benchmarks/long_doc_qa/long_doc_qa_recommender.py
+++ b/benchmarks/long_doc_qa/long_doc_qa_recommender.py
@@ -42,9 +42,7 @@ def get_tensor_parallel_recommendation(model_name: str):
usable_per_gpu_memory = (
per_gpu_memory * 0.9 - intermediate_buffer - minimum_kv_cache_buffer
)
- print(
- "Estimated usable gpu memory for model weights per gpu: {usable_per_gpu_memory}"
- )
+ print(f"Usable gpu memory for model weights per gpu: {usable_per_gpu_memory}")
initial_tp = math.ceil(total_model_weights_gb / usable_per_gpu_memory)
# round up to a power of 2
return 2 ** math.ceil(math.log2(initial_tp))
@@ -158,6 +156,7 @@ def main(model_name: str):
f"but {model_name} requires {tp} tensor parallelism to run on your hardware"
)
return
+ print("This will take a while...")
per_gpu_kv_cache_GiB, tokens_in_prefix_cache = get_prefix_cache_token_size(
model_name, tp
)
@@ -186,9 +185,7 @@ def main(model_name: str):
def build_argument_parser():
parser = argparse.ArgumentParser()
- parser.add_argument(
- "--model", type=str, default="meta-llama/Meta-Llama-3.1-8B-Instruct"
- )
+ parser.add_argument("--model", type=str, default="Qwen/Qwen3-8B")
return parser
diff --git a/benchmarks/multi_round_qa/multi-round-qa.py b/benchmarks/multi_round_qa/multi-round-qa.py
index bda9ea14009..cd02618cab8 100644
--- a/benchmarks/multi_round_qa/multi-round-qa.py
+++ b/benchmarks/multi_round_qa/multi-round-qa.py
@@ -42,6 +42,9 @@ class WorkloadConfig:
# Whether to include user id in request header
enable_user_id: bool
+ # Whether strictly cap active sessions at num_users
+ enforce_strict_concurrent_users: bool = False
+
@dataclass
class UserConfig:
@@ -374,6 +377,10 @@ def __init__(
if self.use_sharegpt:
self._load_sharegpt_data()
+ self.enforce_strict_concurrent_users = (
+ workload_config.enforce_strict_concurrent_users
+ )
+
def _load_sharegpt_data(self):
with open("ShareGPT.json", "r", encoding="utf-8") as file:
self.sharegpt_data = json.load(file)
@@ -419,6 +426,19 @@ def _remove_finished_sessions(self):
self.session_summaries.append(session.summary())
self.sessions = [s for s in self.sessions if not s.finished]
+ def _can_join_user(self, timestamp: float) -> bool:
+ # No new user session if gap_between_users time interval not meets
+ if timestamp - self.last_user_join <= self.gap_between_users:
+ return False
+
+ # No user seession if active user count is less than configured
+ if (
+ self.enforce_strict_concurrent_users
+ and len(self.sessions) >= self.workload_config.num_users
+ ):
+ return False
+ return True
+
def step(self, timestamp: float, executor: RequestExecutor):
if self.need_ramp_up:
self._ramp_up(timestamp, self.ramp_up_time)
@@ -426,7 +446,8 @@ def step(self, timestamp: float, executor: RequestExecutor):
if self.start_time is None:
self.start_time = timestamp
- if timestamp - self.last_user_join > self.gap_between_users:
+ # Check if can join new user session
+ if self._can_join_user(timestamp):
self._create_user_session()
self.last_user_join = timestamp
logger.info(
@@ -635,6 +656,11 @@ def parse_arguments():
action="store_true",
help="Does not send requests to the endpoint (server)",
)
+ parser.add_argument(
+ "--enforce-strict-concurrent-users",
+ action="store_true",
+ help="Strictly enforce concurrent users count to match --num-users",
+ )
args = parser.parse_args()
return args
@@ -688,6 +714,7 @@ def main():
qps=args.qps,
model=args.model,
enable_user_id=args.request_with_user_id,
+ enforce_strict_concurrent_users=args.enforce_strict_concurrent_users,
)
manager = UserSessionManager(
diff --git a/csrc/mem_kernels.cu b/csrc/mem_kernels.cu
index b88fa1379f8..7f89616d35b 100644
--- a/csrc/mem_kernels.cu
+++ b/csrc/mem_kernels.cu
@@ -171,6 +171,56 @@ key_value_offset(const int k_or_v, const int layer_idx, const int token_idx,
token_idx * scalars_per_token + scalar_offset;
}
+template