Conversation
Summary of ChangesHello @xiudongxu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 此拉取请求旨在通过引入全面的缓存策略来解决 Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This PR aims to solve potential memory leak issues by adding caching for tiktoken BPE file loading and model encoding processes. However, the implementation introduces a potential Denial of Service (DoS) vulnerability due to unbounded caches, particularly in aworld/models/utils.py, which can cache invalid model names leading to memory exhaustion if user-controlled input is used. Additionally, there are thread safety concerns with the new global caches (_BPE_CACHE and _TIKTOKEN_ENCODING_CACHE) and code duplication in the _load_tiktoken_bpe function across openai_tokenizer.py and qwen_tokenizer.py.
| _BPE_CACHE = {} | ||
|
|
||
|
|
||
| def _load_tiktoken_bpe(tiktoken_bpe_file: str) -> Dict[bytes, int]: | ||
| """Load tiktoken BPE file similar to qwen_tokenizer.""" | ||
| """Load tiktoken BPE file with caching to prevent memory leaks.""" | ||
| # Check cache first | ||
| if tiktoken_bpe_file in _BPE_CACHE: | ||
| return _BPE_CACHE[tiktoken_bpe_file] | ||
|
|
||
| # Load and decode file | ||
| with open(tiktoken_bpe_file, 'rb') as f: | ||
| contents = f.read() | ||
| return { | ||
| base64.b64decode(token): int(rank) for token, rank in (line.split() for line in contents.splitlines() if line) | ||
|
|
||
| result = { | ||
| base64.b64decode(token): int(rank) | ||
| for token, rank in (line.split() for line in contents.splitlines() if line) | ||
| } | ||
|
|
||
| # Cache the result | ||
| _BPE_CACHE[tiktoken_bpe_file] = result | ||
| return result |
There was a problem hiding this comment.
当前的缓存实现不是线程安全的。在多线程环境中,可能会发生竞态条件,导致多个线程同时尝试加载和缓存同一个BPE文件。这不仅效率低下,还可能导致数据不一致。建议使用 threading.Lock 来保护对缓存的并发访问。
另外,_load_tiktoken_bpe 函数及其缓存 _BPE_CACHE 与 aworld/models/qwen_tokenizer.py 中的代码完全相同。为了提高可维护性,建议将这部分重复代码提取到一个共享的工具模块中(例如,aworld/models/tokenizer_utils.py),这样不仅可以消除重复代码,还能实现一个统一的BPE缓存。
下面是一个线程安全实现的例子(使用双重检查锁定模式以提高效率):
import threading
_BPE_CACHE = {}
_BPE_CACHE_LOCK = threading.Lock()
def _load_tiktoken_bpe(tiktoken_bpe_file: str) -> Dict[bytes, int]:
# 快速路径,无锁检查
if tiktoken_bpe_file in _BPE_CACHE:
return _BPE_CACHE[tiktoken_bpe_file]
# 加锁
with _BPE_CACHE_LOCK:
# 再次检查,防止其他线程已经填充了缓存
if tiktoken_bpe_file in _BPE_CACHE:
return _BPE_CACHE[tiktoken_bpe_file]
# 加载文件并填充缓存
with open(tiktoken_bpe_file, 'rb') as f:
contents = f.read()
result = {
base64.b64decode(token): int(rank)
for token, rank in (line.split() for line in contents.splitlines() if line)
}
_BPE_CACHE[tiktoken_bpe_file] = result
return result| def _get_cached_tiktoken_encoding(model: str): | ||
| """ | ||
| Get cached tiktoken encoding to prevent memory leaks. | ||
|
|
||
| Args: | ||
| model: Model name (e.g., 'gpt-4o', 'claude-3-opus') | ||
|
|
||
| Returns: | ||
| Cached tiktoken encoding object | ||
| """ | ||
| if model not in _TIKTOKEN_ENCODING_CACHE: | ||
| import tiktoken | ||
| try: | ||
| _TIKTOKEN_ENCODING_CACHE[model] = tiktoken.encoding_for_model(model) | ||
| logger.debug(f"Created and cached tiktoken encoding for model: {model}") | ||
| except KeyError: | ||
| logger.debug(f"{model} model not found. Using cl100k_base encoding.") | ||
| # Cache cl100k_base if not already cached | ||
| if "cl100k_base" not in _TIKTOKEN_ENCODING_CACHE: | ||
| _TIKTOKEN_ENCODING_CACHE["cl100k_base"] = tiktoken.get_encoding("cl100k_base") | ||
| # Reuse cl100k_base for this model | ||
| _TIKTOKEN_ENCODING_CACHE[model] = _TIKTOKEN_ENCODING_CACHE["cl100k_base"] | ||
| return _TIKTOKEN_ENCODING_CACHE[model] |
There was a problem hiding this comment.
The _get_cached_tiktoken_encoding function introduces a potential Denial of Service (DoS) vulnerability due to an unbounded cache. The global dictionary _TIKTOKEN_ENCODING_CACHE can store tiktoken encoding objects for every unique model string, including invalid ones, leading to memory exhaustion if the model parameter is user-controlled. Consider using a fixed-size cache (e.g., functools.lru_cache) or validating model names against an allow-list. Furthermore, this cache is not thread-safe, which can lead to race conditions in a multi-threaded environment. It is recommended to use threading.Lock to ensure thread safety for cache operations.
|
⏺ 完美!现在我已经看到了所有相关代码。让我为你详细分析这些缓存的合理性。 📊 tiktoken 缓存机制详细分析 1️⃣ _BPE_CACHE (在 openai_tokenizer.py 和 qwen_tokenizer.py 中) 代码位置
作用 缓存 BPE 词表文件的解析结果,防止重复加载和解析: def _load_tiktoken_bpe(tiktoken_bpe_file: str) -> Dict[bytes, int]: ✅ 合理性分析:非常合理且必要 为什么必须加缓存?
❌ 如果不加缓存会怎样? 场景:每次创建 tokenizer 实例都重新加载 假设系统中有多个地方创建 tokenizertokenizer1 = OpenAITokenizer(vocab_file) # 加载词表 (500ms, +1.68MB) 后果:
测试数据验证: 测试2: 4个任务后 ✅ 结论:缓存工作正常,没有重复加载 2️⃣ _TIKTOKEN_ENCODING_CACHE (在 utils.py 中) 代码位置
作用 缓存 tiktoken.Encoding 对象,防止重复创建: def _get_cached_tiktoken_encoding(model: str): 使用场景
✅ 合理性分析:非常合理且关键 为什么必须加缓存?
tiktoken 内部实现def encoding_for_model(model_name: str) -> Encoding: 在每次 API 调用时都会被调用for message in messages: ❌ 如果不加缓存会怎样? 场景:高并发 API 服务 100 个并发请求for i in range(100): 后果:
真实案例(从你的分析报告): 你的 test_memory_leak.py 正是测试这个问题: 测试 tiktoken encoding 是否导致内存泄漏for i in range(1000): 如果没有缓存,内存会线性增长! 3️⃣ 为什么使用全局字典而不是 @lru_cache? 对比两种方案 方案 A:使用 @lru_cache @lru_cache(maxsize=128) 方案 B:使用全局字典(当前方案) def _load_tiktoken_bpe(tiktoken_bpe_file: str) -> Dict[bytes, int]: ✅ 全局字典更好的原因 整个应用只有 2-3 个词表文件_BPE_CACHE = { 如果用 lru_cache(maxsize=1),可能会反复淘汰和重新加载!📈 性能对比:加缓存 vs 不加缓存 测试场景:1000 次 token 计数调用
|
No description provided.