PYTHON-5679 Optimize `ObjectId` #2656

adamchainz · 2025-12-18T00:05:24Z

Changes in this PR

From profiles of a mongoengine-using project, I’ve noticed that ObjectId creation happens a lot, around ~0.5% of total request time. This PR optimizes the class in two ways:

It inlines the __generate() and __validate() methods, removing their call overhead.

The methods are small and not called anywhere else, and Python imposes a constant overhead for method calls, so inlining them saves time.
Checking for str values before ObjectId ones.

I believe that typical usage patterns involve creating ObjectIds from strings more often than from existing ObjectId instances, so checking for str first should reduce the average number of checks needed.

Test Plan

I validated these changes with the below benchmark script, run with richbench.

bench_objectid.py

from __future__ import annotations

import os
import struct
import threading
import time
from random import SystemRandom
from typing import Any, NoReturn, Optional, Union

_MAX_COUNTER_VALUE = 0xFFFFFF
_PACK_INT = struct.Struct(">I").pack
_PACK_INT_RANDOM = struct.Struct(">I5s").pack

def _raise_invalid_id(oid: str) -> NoReturn:
    raise ValueError(f"{oid!r} is not a valid ObjectId")

def _random_bytes() -> bytes:
    return os.urandom(5)

class ObjectIdBefore:
    _pid = os.getpid()
    _inc = SystemRandom().randint(0, _MAX_COUNTER_VALUE)
    _inc_lock = threading.Lock()
    __random = _random_bytes()
    __slots__ = ("__id",)

    def __init__(self, oid: Optional[Union[str, ObjectIdBefore, bytes]] = None) -> None:
        if oid is None:
            self.__generate()
        elif isinstance(oid, bytes) and len(oid) == 12:
            self.__id = oid
        else:
            self.__validate(oid)

    @classmethod
    def _random(cls) -> bytes:
        pid = os.getpid()
        if pid != cls._pid:
            cls._pid = pid
            cls.__random = _random_bytes()
        return cls.__random

    def __generate(self) -> None:
        with ObjectIdBefore._inc_lock:
            inc = ObjectIdBefore._inc
            ObjectIdBefore._inc = (inc + 1) % (_MAX_COUNTER_VALUE + 1)
        self.__id = (
            _PACK_INT_RANDOM(int(time.time()), ObjectIdBefore._random()) + _PACK_INT(inc)[1:4]
        )

    def __validate(self, oid: Any) -> None:
        if isinstance(oid, ObjectIdBefore):
            self.__id = oid.binary
        elif isinstance(oid, str):
            if len(oid) == 24:
                try:
                    self.__id = bytes.fromhex(oid)
                except (TypeError, ValueError):
                    _raise_invalid_id(oid)
            else:
                _raise_invalid_id(oid)
        else:
            raise TypeError(f"id must be an instance of (bytes, str, ObjectId), not {type(oid)}")

    @property
    def binary(self) -> bytes:
        return self.__id

class ObjectIdAfter:
    _pid = os.getpid()
    _inc = SystemRandom().randint(0, _MAX_COUNTER_VALUE)
    _inc_lock = threading.Lock()
    __random = _random_bytes()
    __slots__ = ("__id",)

    def __init__(self, oid: Optional[Union[str, ObjectIdAfter, bytes]] = None) -> None:
        if oid is None:
            with ObjectIdAfter._inc_lock:
                inc = ObjectIdAfter._inc
                ObjectIdAfter._inc = (inc + 1) % (_MAX_COUNTER_VALUE + 1)
            self.__id = (
                _PACK_INT_RANDOM(int(time.time()), ObjectIdAfter._random()) + _PACK_INT(inc)[1:4]
            )
        elif isinstance(oid, bytes) and len(oid) == 12:
            self.__id = oid
        elif isinstance(oid, str):
            if len(oid) == 24:
                try:
                    self.__id = bytes.fromhex(oid)
                except (TypeError, ValueError):
                    _raise_invalid_id(oid)
            else:
                _raise_invalid_id(oid)
        elif isinstance(oid, ObjectIdAfter):
            self.__id = oid.binary
        else:
            raise TypeError(f"id must be an instance of (bytes, str, ObjectId), not {type(oid)}")

    @classmethod
    def _random(cls) -> bytes:
        pid = os.getpid()
        if pid != cls._pid:
            cls._pid = pid
            cls.__random = _random_bytes()
        return cls.__random

    @property
    def binary(self) -> bytes:
        return self.__id

test_oid_str = "507f1f77bcf86cd799439011"
test_oid_bytes = bytes.fromhex(test_oid_str)
test_oid_obj_before = ObjectIdBefore(test_oid_bytes)
test_oid_obj_after = ObjectIdAfter(test_oid_bytes)
invalid_str_short = "507f1f77"
invalid_str_long = "507f1f77bcf86cd799439011aabbccdd"
invalid_str_non_hex = "507f1f77bcf86cd799439ZZZ"
invalid_bytes_short = b"short"
invalid_bytes_long = b"toolongbytes1234"
invalid_type = 12345

def bench_none_before():
    for _ in range(1_000_000):
        ObjectIdBefore()

def bench_none_after():
    for _ in range(1_000_000):
        ObjectIdAfter()

def bench_str_before():
    for _ in range(1_000_000):
        ObjectIdBefore(test_oid_str)

def bench_str_after():
    for _ in range(1_000_000):
        ObjectIdAfter(test_oid_str)

def bench_objectid_before():
    for _ in range(1_000_000):
        ObjectIdBefore(test_oid_obj_before)

def bench_objectid_after():
    for _ in range(1_000_000):
        ObjectIdAfter(test_oid_obj_after)

def bench_invalid_str_short_before():
    for _ in range(1_000_000):
        try:
            ObjectIdBefore(invalid_str_short)
        except Exception:
            pass

def bench_invalid_str_short_after():
    for _ in range(10_000):
        try:
            ObjectIdAfter(invalid_str_short)
        except Exception:
            pass

def bench_invalid_type_before():
    for _ in range(10_000):
        try:
            ObjectIdBefore(invalid_type)
        except Exception:
            pass

def bench_invalid_type_after():
    for _ in range(10_000):
        try:
            ObjectIdAfter(invalid_type)
        except Exception:
            pass

__benchmarks__ = [
    (bench_none_before, bench_none_after, "ObjectId() - generate new"),
    (bench_str_before, bench_str_after, "ObjectId(str) - 24-char hex"),
    (bench_objectid_before, bench_objectid_after, "ObjectId(ObjectId) - copy"),
    (
        bench_invalid_str_short_before,
        bench_invalid_str_short_after,
        "ObjectId(str) - invalid short",
    ),
    (bench_invalid_type_before, bench_invalid_type_after, "ObjectId(int) - invalid type"),
]

Results:

$ uvx -p 3.13 richbench .
                                           Benchmarks, repeat=5, number=5
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃                     Benchmark ┃ Min     ┃ Max     ┃ Mean    ┃ Min (+)         ┃ Max (+)         ┃ Mean (+)        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│     ObjectId() - generate new │ 1.920   │ 2.021   │ 1.990   │ 1.865 (1.0x)    │ 1.868 (1.1x)    │ 1.866 (1.1x)    │
│   ObjectId(str) - 24-char hex │ 0.938   │ 0.952   │ 0.946   │ 0.716 (1.3x)    │ 0.725 (1.3x)    │ 0.720 (1.3x)    │
│     ObjectId(ObjectId) - copy │ 0.845   │ 0.857   │ 0.850   │ 0.818 (1.0x)    │ 0.828 (1.0x)    │ 0.822 (1.0x)    │
│ ObjectId(str) - invalid short │ 1.839   │ 1.857   │ 1.845   │ 0.014 (127.9x)  │ 0.015 (125.7x)  │ 0.015 (126.4x)  │
│  ObjectId(int) - invalid type │ 0.020   │ 0.021   │ 0.021   │ 0.016 (1.2x)    │ 0.017 (1.3x)    │ 0.017 (1.2x)    │
└───────────────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘

The speedups for normal usage are like 1.1x to 1.3x, which is pretty decent for a small change. The invalid string case is much faster, but that’s less relevant since it’s an error path.

Checklist

Checklist for Author

Did you update the changelog (if necessary)?
Is there test coverage?
Is any followup work tracked in a JIRA ticket? If so, add link(s).

Checklist for Reviewer

Does the title of the PR reference a JIRA Ticket?
Do you fully understand the implementation? (Would you be comfortable explaining how this code works to someone else?)
Is all relevant documentation (README or docstring) updated?

From profiles of a mongoengine-using project, I’ve noticed that `ObjectId` creation happens a lot, around ~0.5% of total request time. This PR optimizes the class in two ways: 1. It inlines the `__generate()` and `__validate()` methods, removing their call overhead. The methods are small and not called anywhere else, and Python imposes a constant overhead for method calls, so inlining them saves time. 2. Checking for `str` values before `ObjectId` ones. I believe that typical usage patterns involve creating `ObjectId`s from strings more often than from existing `ObjectId` instances, so checking for `str` first should reduce the average number of checks needed. I validated these changes with the below benchmark script, run with [richbench](https://github.com/tonybaloney/rich-bench). <details> <summary><code>bench_objectid.py</code></summary> ```python from __future__ import annotations import os import struct import threading import time from random import SystemRandom from typing import Any, NoReturn, Optional, Union _MAX_COUNTER_VALUE = 0xFFFFFF _PACK_INT = struct.Struct(">I").pack _PACK_INT_RANDOM = struct.Struct(">I5s").pack def _raise_invalid_id(oid: str) -> NoReturn: raise ValueError(f"{oid!r} is not a valid ObjectId") def _random_bytes() -> bytes: return os.urandom(5) class ObjectIdBefore: _pid = os.getpid() _inc = SystemRandom().randint(0, _MAX_COUNTER_VALUE) _inc_lock = threading.Lock() __random = _random_bytes() __slots__ = ("__id",) def __init__(self, oid: Optional[Union[str, ObjectIdBefore, bytes]] = None) -> None: if oid is None: self.__generate() elif isinstance(oid, bytes) and len(oid) == 12: self.__id = oid else: self.__validate(oid) @classmethod def _random(cls) -> bytes: pid = os.getpid() if pid != cls._pid: cls._pid = pid cls.__random = _random_bytes() return cls.__random def __generate(self) -> None: with ObjectIdBefore._inc_lock: inc = ObjectIdBefore._inc ObjectIdBefore._inc = (inc + 1) % (_MAX_COUNTER_VALUE + 1) self.__id = ( _PACK_INT_RANDOM(int(time.time()), ObjectIdBefore._random()) + _PACK_INT(inc)[1:4] ) def __validate(self, oid: Any) -> None: if isinstance(oid, ObjectIdBefore): self.__id = oid.binary elif isinstance(oid, str): if len(oid) == 24: try: self.__id = bytes.fromhex(oid) except (TypeError, ValueError): _raise_invalid_id(oid) else: _raise_invalid_id(oid) else: raise TypeError(f"id must be an instance of (bytes, str, ObjectId), not {type(oid)}") @Property def binary(self) -> bytes: return self.__id class ObjectIdAfter: _pid = os.getpid() _inc = SystemRandom().randint(0, _MAX_COUNTER_VALUE) _inc_lock = threading.Lock() __random = _random_bytes() __slots__ = ("__id",) def __init__(self, oid: Optional[Union[str, ObjectIdAfter, bytes]] = None) -> None: if oid is None: with ObjectIdAfter._inc_lock: inc = ObjectIdAfter._inc ObjectIdAfter._inc = (inc + 1) % (_MAX_COUNTER_VALUE + 1) self.__id = ( _PACK_INT_RANDOM(int(time.time()), ObjectIdAfter._random()) + _PACK_INT(inc)[1:4] ) elif isinstance(oid, bytes) and len(oid) == 12: self.__id = oid elif isinstance(oid, str): if len(oid) == 24: try: self.__id = bytes.fromhex(oid) except (TypeError, ValueError): _raise_invalid_id(oid) else: _raise_invalid_id(oid) elif isinstance(oid, ObjectIdAfter): self.__id = oid.binary else: raise TypeError(f"id must be an instance of (bytes, str, ObjectId), not {type(oid)}") @classmethod def _random(cls) -> bytes: pid = os.getpid() if pid != cls._pid: cls._pid = pid cls.__random = _random_bytes() return cls.__random @Property def binary(self) -> bytes: return self.__id test_oid_str = "507f1f77bcf86cd799439011" test_oid_bytes = bytes.fromhex(test_oid_str) test_oid_obj_before = ObjectIdBefore(test_oid_bytes) test_oid_obj_after = ObjectIdAfter(test_oid_bytes) invalid_str_short = "507f1f77" invalid_str_long = "507f1f77bcf86cd799439011aabbccdd" invalid_str_non_hex = "507f1f77bcf86cd799439ZZZ" invalid_bytes_short = b"short" invalid_bytes_long = b"toolongbytes1234" invalid_type = 12345 def bench_none_before(): for _ in range(1_000_000): ObjectIdBefore() def bench_none_after(): for _ in range(1_000_000): ObjectIdAfter() def bench_str_before(): for _ in range(1_000_000): ObjectIdBefore(test_oid_str) def bench_str_after(): for _ in range(1_000_000): ObjectIdAfter(test_oid_str) def bench_objectid_before(): for _ in range(1_000_000): ObjectIdBefore(test_oid_obj_before) def bench_objectid_after(): for _ in range(1_000_000): ObjectIdAfter(test_oid_obj_after) def bench_invalid_str_short_before(): for _ in range(1_000_000): try: ObjectIdBefore(invalid_str_short) except Exception: pass def bench_invalid_str_short_after(): for _ in range(10_000): try: ObjectIdAfter(invalid_str_short) except Exception: pass def bench_invalid_type_before(): for _ in range(10_000): try: ObjectIdBefore(invalid_type) except Exception: pass def bench_invalid_type_after(): for _ in range(10_000): try: ObjectIdAfter(invalid_type) except Exception: pass __benchmarks__ = [ (bench_none_before, bench_none_after, "ObjectId() - generate new"), (bench_str_before, bench_str_after, "ObjectId(str) - 24-char hex"), (bench_objectid_before, bench_objectid_after, "ObjectId(ObjectId) - copy"), ( bench_invalid_str_short_before, bench_invalid_str_short_after, "ObjectId(str) - invalid short", ), (bench_invalid_type_before, bench_invalid_type_after, "ObjectId(int) - invalid type"), ] ``` </details> Results: ``` $ uvx -p 3.13 richbench . Benchmarks, repeat=5, number=5 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Benchmark ┃ Min ┃ Max ┃ Mean ┃ Min (+) ┃ Max (+) ┃ Mean (+) ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ ObjectId() - generate new │ 1.920 │ 2.021 │ 1.990 │ 1.865 (1.0x) │ 1.868 (1.1x) │ 1.866 (1.1x) │ │ ObjectId(str) - 24-char hex │ 0.938 │ 0.952 │ 0.946 │ 0.716 (1.3x) │ 0.725 (1.3x) │ 0.720 (1.3x) │ │ ObjectId(ObjectId) - copy │ 0.845 │ 0.857 │ 0.850 │ 0.818 (1.0x) │ 0.828 (1.0x) │ 0.822 (1.0x) │ │ ObjectId(str) - invalid short │ 1.839 │ 1.857 │ 1.845 │ 0.014 (127.9x) │ 0.015 (125.7x) │ 0.015 (126.4x) │ │ ObjectId(int) - invalid type │ 0.020 │ 0.021 │ 0.021 │ 0.016 (1.2x) │ 0.017 (1.3x) │ 0.017 (1.2x) │ └───────────────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘ ``` The speedups for normal usage are like 1.1x to 1.3x, which is pretty decent for a small change. The invalid string case is much faster, but that’s less relevant since it’s an error path.

Like mongodb#2656, I spotted this function taking ~0.9% of request time in a project using mongoengine. [`bytes.hex()`](https://docs.python.org/3/library/stdtypes.html#bytes.hex) was added in Python 3.5 and does the same as `binascii.hexlify(...).decode()` but faster, so I've replaced it here. (The `binascii.hexlify()` docs also now point to `bytes.hex()` as a faster alternative.) Benchmarked again using [`richbench`](https://github.com/tonybaloney/rich-bench) with the below script. <details> <summary><code>bench_objectid_str.py</code></summary> ```python from __future__ import annotations import binascii import os class ObjectIdBefore: __slots__ = ("__id",) def __init__(self) -> None: self.__id = os.urandom(12) def __str__(self) -> str: return binascii.hexlify(self.__id).decode() class ObjectIdAfter: __slots__ = ("__id",) def __init__(self) -> None: self.__id = os.urandom(12) def __str__(self) -> str: return self.__id.hex() test_oids_before = [ObjectIdBefore() for _ in range(100)] test_oids_after = [ObjectIdAfter() for _ in range(100)] def bench_str_before(): for oid in test_oids_before: for _ in range(100_000): str(oid) def bench_str_after(): for oid in test_oids_after: for _ in range(100_000): str(oid) __benchmarks__ = [ (bench_str_before, bench_str_after, "str(ObjectId) - hex conversion"), ] ``` </details> Results show a ~1.4x speedup: ``` $ uvx -p 3.13 richbench . Benchmarks, repeat=5, number=5 ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ Benchmark ┃ Min ┃ Max ┃ Mean ┃ Min (+) ┃ Max (+) ┃ Mean (+) ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ str(ObjectId) - hex conversion │ 3.000 │ 3.061 │ 3.019 │ 2.169 (1.4x) │ 2.231 (1.4x) │ 2.196 (1.4x) │ └────────────────────────────────┴─────────┴─────────┴─────────┴─────────────────┴─────────────────┴─────────────────┘ ```

blink1073 · 2025-12-18T02:52:51Z

Thanks for these! The code changes on both look good. I pulled from master to pick up a typing fix and to trigger the Evergreen build. I'll circle back tomorrow for final approval.

blink1073

LGTM!

adamchainz requested a review from a team as a code owner December 18, 2025 00:05

adamchainz requested a review from blink1073 December 18, 2025 00:05

adamchainz mentioned this pull request Dec 18, 2025

PYTHON-5679 Optimize ObjectId.__str__() #2657

Merged

6 tasks

Merge branch 'master' into optimize_objectid

6f1fcae

blink1073 changed the title ~~Optimize ObjectId~~ PYTHON-5679 Optimize ObjectId Dec 18, 2025

blink1073 approved these changes Dec 18, 2025

View reviewed changes

blink1073 merged commit b1ea391 into mongodb:master Dec 18, 2025
79 of 80 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PYTHON-5679 Optimize `ObjectId` #2656

PYTHON-5679 Optimize `ObjectId` #2656

adamchainz commented Dec 18, 2025

Uh oh!

blink1073 commented Dec 18, 2025

Uh oh!

blink1073 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PYTHON-5679 Optimize ObjectId #2656

PYTHON-5679 Optimize ObjectId #2656

Conversation

adamchainz commented Dec 18, 2025

Changes in this PR

Test Plan

Checklist

Checklist for Author

Checklist for Reviewer

Uh oh!

blink1073 commented Dec 18, 2025

Uh oh!

blink1073 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PYTHON-5679 Optimize `ObjectId` #2656

PYTHON-5679 Optimize `ObjectId` #2656