-
Notifications
You must be signed in to change notification settings - Fork 1.1k
PYTHON-5679 Optimize ObjectId
#2656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
+18
−33
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
From profiles of a mongoengine-using project, I’ve noticed that `ObjectId` creation happens a lot, around ~0.5% of total request time. This PR optimizes the class in two ways: 1. It inlines the `__generate()` and `__validate()` methods, removing their call overhead. The methods are small and not called anywhere else, and Python imposes a constant overhead for method calls, so inlining them saves time. 2. Checking for `str` values before `ObjectId` ones. I believe that typical usage patterns involve creating `ObjectId`s from strings more often than from existing `ObjectId` instances, so checking for `str` first should reduce the average number of checks needed. I validated these changes with the below benchmark script, run with [richbench](https://github.com/tonybaloney/rich-bench). <details> <summary><code>bench_objectid.py</code></summary> ```python from __future__ import annotations import os import struct import threading import time from random import SystemRandom from typing import Any, NoReturn, Optional, Union _MAX_COUNTER_VALUE = 0xFFFFFF _PACK_INT = struct.Struct(">I").pack _PACK_INT_RANDOM = struct.Struct(">I5s").pack def _raise_invalid_id(oid: str) -> NoReturn: raise ValueError(f"{oid!r} is not a valid ObjectId") def _random_bytes() -> bytes: return os.urandom(5) class ObjectIdBefore: _pid = os.getpid() _inc = SystemRandom().randint(0, _MAX_COUNTER_VALUE) _inc_lock = threading.Lock() __random = _random_bytes() __slots__ = ("__id",) def __init__(self, oid: Optional[Union[str, ObjectIdBefore, bytes]] = None) -> None: if oid is None: self.__generate() elif isinstance(oid, bytes) and len(oid) == 12: self.__id = oid else: self.__validate(oid) @classmethod def _random(cls) -> bytes: pid = os.getpid() if pid != cls._pid: cls._pid = pid cls.__random = _random_bytes() return cls.__random def __generate(self) -> None: with ObjectIdBefore._inc_lock: inc = ObjectIdBefore._inc ObjectIdBefore._inc = (inc + 1) % (_MAX_COUNTER_VALUE + 1) self.__id = ( _PACK_INT_RANDOM(int(time.time()), ObjectIdBefore._random()) + _PACK_INT(inc)[1:4] ) def __validate(self, oid: Any) -> None: if isinstance(oid, ObjectIdBefore): self.__id = oid.binary elif isinstance(oid, str): if len(oid) == 24: try: self.__id = bytes.fromhex(oid) except (TypeError, ValueError): _raise_invalid_id(oid) else: _raise_invalid_id(oid) else: raise TypeError(f"id must be an instance of (bytes, str, ObjectId), not {type(oid)}") @Property def binary(self) -> bytes: return self.__id class ObjectIdAfter: _pid = os.getpid() _inc = SystemRandom().randint(0, _MAX_COUNTER_VALUE) _inc_lock = threading.Lock() __random = _random_bytes() __slots__ = ("__id",) def __init__(self, oid: Optional[Union[str, ObjectIdAfter, bytes]] = None) -> None: if oid is None: with ObjectIdAfter._inc_lock: inc = ObjectIdAfter._inc ObjectIdAfter._inc = (inc + 1) % (_MAX_COUNTER_VALUE + 1) self.__id = ( _PACK_INT_RANDOM(int(time.time()), ObjectIdAfter._random()) + _PACK_INT(inc)[1:4] ) elif isinstance(oid, bytes) and len(oid) == 12: self.__id = oid elif isinstance(oid, str): if len(oid) == 24: try: self.__id = bytes.fromhex(oid) except (TypeError, ValueError): _raise_invalid_id(oid) else: _raise_invalid_id(oid) elif isinstance(oid, ObjectIdAfter): self.__id = oid.binary else: raise TypeError(f"id must be an instance of (bytes, str, ObjectId), not {type(oid)}") @classmethod def _random(cls) -> bytes: pid = os.getpid() if pid != cls._pid: cls._pid = pid cls.__random = _random_bytes() return cls.__random @Property def binary(self) -> bytes: return self.__id test_oid_str = "507f1f77bcf86cd799439011" test_oid_bytes = bytes.fromhex(test_oid_str) test_oid_obj_before = ObjectIdBefore(test_oid_bytes) test_oid_obj_after = ObjectIdAfter(test_oid_bytes) invalid_str_short = "507f1f77" invalid_str_long = "507f1f77bcf86cd799439011aabbccdd" invalid_str_non_hex = "507f1f77bcf86cd799439ZZZ" invalid_bytes_short = b"short" invalid_bytes_long = b"toolongbytes1234" invalid_type = 12345 def bench_none_before(): for _ in range(1_000_000): ObjectIdBefore() def bench_none_after(): for _ in range(1_000_000): ObjectIdAfter() def bench_str_before(): for _ in range(1_000_000): ObjectIdBefore(test_oid_str) def bench_str_after(): for _ in range(1_000_000): ObjectIdAfter(test_oid_str) def bench_objectid_before(): for _ in range(1_000_000): ObjectIdBefore(test_oid_obj_before) def bench_objectid_after(): for _ in range(1_000_000): ObjectIdAfter(test_oid_obj_after) def bench_invalid_str_short_before(): for _ in range(1_000_000): try: ObjectIdBefore(invalid_str_short) except Exception: pass def bench_invalid_str_short_after(): for _ in range(10_000): try: ObjectIdAfter(invalid_str_short) except Exception: pass def bench_invalid_type_before(): for _ in range(10_000): try: ObjectIdBefore(invalid_type) except Exception: pass def bench_invalid_type_after(): for _ in range(10_000): try: ObjectIdAfter(invalid_type) except Exception: pass __benchmarks__ = [ (bench_none_before, bench_none_after, "ObjectId() - generate new"), (bench_str_before, bench_str_after, "ObjectId(str) - 24-char hex"), (bench_objectid_before, bench_objectid_after, "ObjectId(ObjectId) - copy"), ( bench_invalid_str_short_before, bench_invalid_str_short_after, "ObjectId(str) - invalid short", ), (bench_invalid_type_before, bench_invalid_type_after, "ObjectId(int) - invalid type"), ] ``` </details> Results: ``` $ uvx -p 3.13 richbench . Benchmarks, repeat=5, number=5 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Benchmark ┃ Min ┃ Max ┃ Mean ┃ Min (+) ┃ Max (+) ┃ Mean (+) ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ ObjectId() - generate new │ 1.920 │ 2.021 │ 1.990 │ 1.865 (1.0x) │ 1.868 (1.1x) │ 1.866 (1.1x) │ │ ObjectId(str) - 24-char hex │ 0.938 │ 0.952 │ 0.946 │ 0.716 (1.3x) │ 0.725 (1.3x) │ 0.720 (1.3x) │ │ ObjectId(ObjectId) - copy │ 0.845 │ 0.857 │ 0.850 │ 0.818 (1.0x) │ 0.828 (1.0x) │ 0.822 (1.0x) │ │ ObjectId(str) - invalid short │ 1.839 │ 1.857 │ 1.845 │ 0.014 (127.9x) │ 0.015 (125.7x) │ 0.015 (126.4x) │ │ ObjectId(int) - invalid type │ 0.020 │ 0.021 │ 0.021 │ 0.016 (1.2x) │ 0.017 (1.3x) │ 0.017 (1.2x) │ └───────────────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘ ``` The speedups for normal usage are like 1.1x to 1.3x, which is pretty decent for a small change. The invalid string case is much faster, but that’s less relevant since it’s an error path.
adamchainz
added a commit
to adamchainz/mongo-python-driver
that referenced
this pull request
Dec 18, 2025
Like mongodb#2656, I spotted this function taking ~0.9% of request time in a project using mongoengine. [`bytes.hex()`](https://docs.python.org/3/library/stdtypes.html#bytes.hex) was added in Python 3.5 and does the same as `binascii.hexlify(...).decode()` but faster, so I've replaced it here. (The `binascii.hexlify()` docs also now point to `bytes.hex()` as a faster alternative.) Benchmarked again using [`richbench`](https://github.com/tonybaloney/rich-bench) with the below script. <details> <summary><code>bench_objectid_str.py</code></summary> ```python from __future__ import annotations import binascii import os class ObjectIdBefore: __slots__ = ("__id",) def __init__(self) -> None: self.__id = os.urandom(12) def __str__(self) -> str: return binascii.hexlify(self.__id).decode() class ObjectIdAfter: __slots__ = ("__id",) def __init__(self) -> None: self.__id = os.urandom(12) def __str__(self) -> str: return self.__id.hex() test_oids_before = [ObjectIdBefore() for _ in range(100)] test_oids_after = [ObjectIdAfter() for _ in range(100)] def bench_str_before(): for oid in test_oids_before: for _ in range(100_000): str(oid) def bench_str_after(): for oid in test_oids_after: for _ in range(100_000): str(oid) __benchmarks__ = [ (bench_str_before, bench_str_after, "str(ObjectId) - hex conversion"), ] ``` </details> Results show a ~1.4x speedup: ``` $ uvx -p 3.13 richbench . Benchmarks, repeat=5, number=5 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Benchmark ┃ Min ┃ Max ┃ Mean ┃ Min (+) ┃ Max (+) ┃ Mean (+) ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ str(ObjectId) - hex conversion │ 3.000 │ 3.061 │ 3.019 │ 2.169 (1.4x) │ 2.231 (1.4x) │ 2.196 (1.4x) │ └────────────────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘ ```
6 tasks
Member
|
Thanks for these! The code changes on both look good. I pulled from |
blink1073
approved these changes
Dec 18, 2025
Member
blink1073
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes in this PR
From profiles of a mongoengine-using project, I’ve noticed that
ObjectIdcreation happens a lot, around ~0.5% of total request time. This PR optimizes the class in two ways:It inlines the
__generate()and__validate()methods, removing their call overhead.The methods are small and not called anywhere else, and Python imposes a constant overhead for method calls, so inlining them saves time.
Checking for
strvalues beforeObjectIdones.I believe that typical usage patterns involve creating
ObjectIds from strings more often than from existingObjectIdinstances, so checking forstrfirst should reduce the average number of checks needed.Test Plan
I validated these changes with the below benchmark script, run with richbench.
bench_objectid.pyResults:
The speedups for normal usage are like 1.1x to 1.3x, which is pretty decent for a small change. The invalid string case is much faster, but that’s less relevant since it’s an error path.
Checklist
Checklist for Author
Checklist for Reviewer