Skip to content

Add owned PdfString/PdfVec types to fix UB with PdfObject as_string, as_name, as_bytes methods#208

Open
JustForFun88 wants to merge 2 commits into
messense:mainfrom
JustForFun88:new_string_to_fix_ub
Open

Add owned PdfString/PdfVec types to fix UB with PdfObject as_string, as_name, as_bytes methods#208
JustForFun88 wants to merge 2 commits into
messense:mainfrom
JustForFun88:new_string_to_fix_ub

Conversation

@JustForFun88

Copy link
Copy Markdown
Contributor

Closes #207. Return owned types instead of borrowed references:

Before After
as_string() -> &str as_string() -> PdfString
as_name() -> &[u8] as_name() -> PdfVec
as_bytes() -> &[u8] as_bytes() -> PdfVec

I decided to use custom PdfString / PdfVec types instead of String / Vec<u8> because these custom structures allow the following:

  • Inline storage (up to 23 bytes) with no heap allocation, which covers 99% of standard PDF names (Type, Subtype, Font, Rect, ...) and most short string values.
  • Heap allocation uses ecow::EcoVec<u8> which is a reference-counted, copy-on-write vector. Since we do not need to mutate the returned strings, I think that cheap cloning is preferable.

I didn't use existing Rust crates, since most compact string crates (e.g. compact_str, ecow::EcoString, etc.) are designed for Rust strings and cannot provide all of the following at the same time:

  • Inline storage;
  • Cheap &CStr conversion without allocation;
  • Niche optimization.

PdfString / PdfVec are NUL-terminated, so .as_cstr() is just a pointer cast. Additionally, they support niche optimization, so Option<PdfString> / Option<PdfVec> have the same size as the structures themselves.

I also added a helper TryAsCStr trait so functions like get_dict, new_name, and new_string accept not only &str, but also &CStr or PdfString directly. This allows avoiding runtime CString allocations when the caller already has a NUL-terminated value (e.g. passing c"Subtype" instead of "Subtype").

Comment thread compact_cstr/src/repr.rs
Comment thread src/pdf/primitives/repr.rs Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces owned, NUL-terminated PdfString/PdfVec primitives (inline up to 23 bytes; heap via ecow::EcoVec) and updates PdfObject::{as_string, as_name, as_bytes} to return owned values to prevent dangling references when MuPDF resolves indirect objects.

Changes:

  • Add pdf::primitives::{PdfVec, PdfString} backed by a shared 24-byte Repr with inline + COW heap storage.
  • Update PdfObject::{as_string, as_name, as_bytes} to return owned PdfString/PdfVec, plus add TryAsCStr and broaden string-taking APIs (new_name, new_string, etc.).
  • Update link/destination parsing and add regression tests around indirect-object deletion.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/pdf/primitives/mod.rs New primitives module; exports PdfString/PdfVec and defines comparison macro + layout asserts.
src/pdf/primitives/repr.rs Implements the shared 24-byte inline/heap representation and &CStr view.
src/pdf/primitives/vector.rs Adds PdfVec owned byte container with as_slice/as_cstr and comparison impls.
src/pdf/primitives/string.rs Adds PdfString wrapper around PdfVec with UTF-8 validation and string conveniences.
src/pdf/primitives/tests.rs Unit tests for layout, inline/heap behavior, and PartialEq interoperability.
src/pdf/object.rs Switches as_* accessors to owned returns; adds TryAsCStr, extends IntoPdfDictKey, adds name_eq.
src/pdf/mod.rs Exposes primitives module and reexports PdfString/PdfVec + new traits.
src/pdf/links/extraction.rs Adapts parsing to new owned return types and new PdfVec/PdfString conversions.
src/pdf/document.rs Broadens APIs to accept TryAsCStr; adds indirect-delete regression tests.
src/pdf/annotation.rs Updates set_author to accept TryAsCStr.
src/destination.rs Adjusts name matching to use PdfVec::as_slice().
Cargo.toml Adds ecow dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/pdf/primitives/vector.rs Outdated
Comment thread src/pdf/primitives/string.rs Outdated
Comment thread src/pdf/document.rs
Comment thread src/pdf/primitives/repr.rs Outdated

@itsjunetime itsjunetime left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do appreciate the work that's gone into implementing this and ensuring it has all these very desirable properties. It's cool to see.

However, I don't think we should merge this as-is. There are too many potential vectors for unsoundness, which is not super great especially since we're trying to add this to avoid unsoundness.

Imo, structs like this are generally best off living in their own crate, where they are more easily auditable and analyzable. I think putting this sort of CStr representation in its own crate would also just be nice for the ecosystem - I doubt mupdf is the first crate to want a nicely-optimized small c string.

I also think that this would be better off using some sort of property testing (running under miri, ofc) to suss out other instances of potential unsoundness. A simple proptest that just generates nonsense strings and vecs of bytes, passes them through a few different functions that we've added here, and validates that they come out the other end looking exactly the same, would be a nice way to feel more confident about this.

Comment thread src/pdf/primitives/repr.rs Outdated
Comment thread src/pdf/primitives/repr.rs Outdated
Comment thread compact_cstr/src/repr.rs
Comment thread compact_cstr/src/repr.rs
Comment thread compact_cstr/src/repr.rs
Comment thread compact_cstr/src/repr.rs
Comment thread src/pdf/primitives/string.rs Outdated
Comment thread src/pdf/primitives/vector.rs Outdated
Comment thread src/pdf/primitives/vector.rs Outdated
Comment thread compact_cstr/src/repr.rs Outdated
@JustForFun88 JustForFun88 force-pushed the new_string_to_fix_ub branch 3 times, most recently from 92564ed to 4893331 Compare May 3, 2026 17:44
@JustForFun88

Copy link
Copy Markdown
Contributor Author

I do appreciate the work that's gone into implementing this and ensuring it has all these very desirable properties. It's cool to see.

Thanks, I really appreciate the feedback.

However, I don't think we should merge this as-is. There are too many potential vectors for unsoundness, which is not super great especially since we're trying to add this to avoid unsoundness.

I’ve addressed this by tightening the invariants and removing the possibility of constructing these types incorrectly from arbitrary Vec<u8> / strings. Unsafe parts are clearly documented

Imo, structs like this are generally best off living in their own crate, where they are more easily auditable and analyzable. I think putting this sort of CStr representation in its own crate would also just be nice for the ecosystem - I doubt mupdf is the first crate to want a nicely-optimized small c string.

Makes sense. I’ve moved the implementation into a separate crate to make it easier to test and reuse independently.

I also think that this would be better off using some sort of property testing (running under miri, ofc) to suss out other instances of potential unsoundness. A simple proptest that just generates nonsense strings and vecs of bytes, passes them through a few different functions that we've added here, and validates that they come out the other end looking exactly the same, would be a nice way to feel more confident about this.

Done. I added property-based tests using proptest, along with runs under Miri, ASan, and Valgrind. Since Miri is quite slow, the number of proptest cases is reduced to 2 when running under it.

@messense messense force-pushed the new_string_to_fix_ub branch from 4893331 to 2e2f9b8 Compare June 2, 2026 10:44
@messense

messense commented Jun 2, 2026

Copy link
Copy Markdown
Owner

I verified the test_as_bytes_preserves_nul_bytes failure and the underlying claim looks correct.

PdfObject::as_bytes() uses mupdf_pdf_to_bytes, which wraps MuPDF’s raw pdf_to_string(ctx, obj, len) API. That API returns STRING(obj)->buf together with STRING(obj)->len, so the string contents are length-based raw PDF string bytes, not a NUL-free C string payload.

MuPDF also clearly allows PDF string bytes to contain 0x00:

  • pdf_new_string stores len bytes with memcpy(obj->buf, str, len), records obj->len, and only appends an extra terminator at obj->buf[len].
  • The literal-string lexer writes octal escapes directly, so (a\000b) produces b"a\0b".
  • The hex-string lexer writes arbitrary byte values, so <610062> also represents b"a\0b".

The new test demonstrates the regression: as_bytes() currently tries to convert b"a\0b" into CompactCBytes, which rejects interior NULs and returns InteriorNul { position: 1 }.

So I think as_bytes() should not return CompactCBytes. It should return a NUL-permitting owned byte type, e.g. Vec<u8>, or a separate compact byte type that does not maintain the CStr invariant. The failing test is a good regression test to keep and should pass once as_bytes() preserves arbitrary PDF string bytes.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Comment thread src/pdf/object.rs
Comment on lines +240 to 245
pub fn as_bytes(&self) -> Result<CompactCBytes, Error> {
let mut len = 0;
let ptr = unsafe { ffi_try!(mupdf_pdf_to_bytes(context(), self.inner, &mut len)) }?;
Ok(unsafe { slice::from_raw_parts(ptr, len) })
let bytes = unsafe { slice::from_raw_parts(ptr, len) };
CompactCBytes::try_from(bytes).map_err(Into::into)
}
Comment thread compact_cstr/Cargo.toml
Comment on lines +9 to +11
[dependencies]
ecow = "0.2"
memchr = "2"
@JustForFun88

Copy link
Copy Markdown
Contributor Author

@messense I will try to check next week. Unfortunately, currently I am very busy and don’t have time

@messense

Copy link
Copy Markdown
Owner

I've opened #245 without the optimizations to just fix the UB first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UB with PdfObject as_string, as_name, as_bytes methods

5 participants