Add owned PdfString/PdfVec types to fix UB with PdfObject as_string, as_name, as_bytes methods#208
Add owned PdfString/PdfVec types to fix UB with PdfObject as_string, as_name, as_bytes methods#208JustForFun88 wants to merge 2 commits into
PdfObject as_string, as_name, as_bytes methods#208Conversation
There was a problem hiding this comment.
Pull request overview
Introduces owned, NUL-terminated PdfString/PdfVec primitives (inline up to 23 bytes; heap via ecow::EcoVec) and updates PdfObject::{as_string, as_name, as_bytes} to return owned values to prevent dangling references when MuPDF resolves indirect objects.
Changes:
- Add
pdf::primitives::{PdfVec, PdfString}backed by a shared 24-byteReprwith inline + COW heap storage. - Update
PdfObject::{as_string, as_name, as_bytes}to return ownedPdfString/PdfVec, plus addTryAsCStrand broaden string-taking APIs (new_name,new_string, etc.). - Update link/destination parsing and add regression tests around indirect-object deletion.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/pdf/primitives/mod.rs |
New primitives module; exports PdfString/PdfVec and defines comparison macro + layout asserts. |
src/pdf/primitives/repr.rs |
Implements the shared 24-byte inline/heap representation and &CStr view. |
src/pdf/primitives/vector.rs |
Adds PdfVec owned byte container with as_slice/as_cstr and comparison impls. |
src/pdf/primitives/string.rs |
Adds PdfString wrapper around PdfVec with UTF-8 validation and string conveniences. |
src/pdf/primitives/tests.rs |
Unit tests for layout, inline/heap behavior, and PartialEq interoperability. |
src/pdf/object.rs |
Switches as_* accessors to owned returns; adds TryAsCStr, extends IntoPdfDictKey, adds name_eq. |
src/pdf/mod.rs |
Exposes primitives module and reexports PdfString/PdfVec + new traits. |
src/pdf/links/extraction.rs |
Adapts parsing to new owned return types and new PdfVec/PdfString conversions. |
src/pdf/document.rs |
Broadens APIs to accept TryAsCStr; adds indirect-delete regression tests. |
src/pdf/annotation.rs |
Updates set_author to accept TryAsCStr. |
src/destination.rs |
Adjusts name matching to use PdfVec::as_slice(). |
Cargo.toml |
Adds ecow dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
itsjunetime
left a comment
There was a problem hiding this comment.
I do appreciate the work that's gone into implementing this and ensuring it has all these very desirable properties. It's cool to see.
However, I don't think we should merge this as-is. There are too many potential vectors for unsoundness, which is not super great especially since we're trying to add this to avoid unsoundness.
Imo, structs like this are generally best off living in their own crate, where they are more easily auditable and analyzable. I think putting this sort of CStr representation in its own crate would also just be nice for the ecosystem - I doubt mupdf is the first crate to want a nicely-optimized small c string.
I also think that this would be better off using some sort of property testing (running under miri, ofc) to suss out other instances of potential unsoundness. A simple proptest that just generates nonsense strings and vecs of bytes, passes them through a few different functions that we've added here, and validates that they come out the other end looking exactly the same, would be a nice way to feel more confident about this.
92564ed to
4893331
Compare
Thanks, I really appreciate the feedback.
I’ve addressed this by tightening the invariants and removing the possibility of constructing these types incorrectly from arbitrary
Makes sense. I’ve moved the implementation into a separate crate to make it easier to test and reuse independently.
Done. I added property-based tests using proptest, along with runs under Miri, ASan, and Valgrind. Since Miri is quite slow, the number of proptest cases is reduced to 2 when running under it. |
4893331 to
2e2f9b8
Compare
|
I verified the
MuPDF also clearly allows PDF string bytes to contain
The new test demonstrates the regression: So I think |
| pub fn as_bytes(&self) -> Result<CompactCBytes, Error> { | ||
| let mut len = 0; | ||
| let ptr = unsafe { ffi_try!(mupdf_pdf_to_bytes(context(), self.inner, &mut len)) }?; | ||
| Ok(unsafe { slice::from_raw_parts(ptr, len) }) | ||
| let bytes = unsafe { slice::from_raw_parts(ptr, len) }; | ||
| CompactCBytes::try_from(bytes).map_err(Into::into) | ||
| } |
| [dependencies] | ||
| ecow = "0.2" | ||
| memchr = "2" |
|
@messense I will try to check next week. Unfortunately, currently I am very busy and don’t have time |
|
I've opened #245 without the optimizations to just fix the UB first. |
Closes #207. Return owned types instead of borrowed references:
as_string() -> &stras_string() -> PdfStringas_name() -> &[u8]as_name() -> PdfVecas_bytes() -> &[u8]as_bytes() -> PdfVecI decided to use custom
PdfString/PdfVectypes instead ofString/Vec<u8>because these custom structures allow the following:Type,Subtype,Font,Rect, ...) and most short string values.ecow::EcoVec<u8>which is a reference-counted, copy-on-write vector. Since we do not need to mutate the returned strings, I think that cheap cloning is preferable.I didn't use existing Rust crates, since most compact string crates (e.g.
compact_str,ecow::EcoString, etc.) are designed for Rust strings and cannot provide all of the following at the same time:&CStrconversion without allocation;PdfString/PdfVecare NUL-terminated, so.as_cstr()is just a pointer cast. Additionally, they support niche optimization, soOption<PdfString>/Option<PdfVec>have the same size as the structures themselves.I also added a helper
TryAsCStrtrait so functions likeget_dict,new_name, andnew_stringaccept not only&str, but also&CStrorPdfStringdirectly. This allows avoiding runtimeCStringallocations when the caller already has a NUL-terminated value (e.g. passingc"Subtype"instead of"Subtype").