Skip to content

Conversation

@dtburk
Copy link
Collaborator

@dtburk dtburk commented Jun 11, 2025

From CRAN checks:

Version: 0.2.4
Check: compiled code
Result: NOTE
  File ‘hipread/libs/hipread.so’:
    Found non-API calls to R: ‘SETLENGTH’, ‘SET_TRUELENGTH’
  
  Compiled code should not call non-API entry points in R.
  
  See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual,
  and section ‘Moving into C API compliance’ for issues with the use of
  non-API entry points.
Flavors: [r-devel-linux-x86_64-debian-clang](https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-clang/hipread-00check.html), [r-devel-linux-x86_64-debian-gcc](https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-debian-gcc/hipread-00check.html), [r-devel-linux-x86_64-fedora-clang](https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-fedora-clang/hipread-00check.html), [r-devel-linux-x86_64-fedora-gcc](https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-fedora-gcc/hipread-00check.html), [r-patched-linux-x86_64](https://www.r-project.org/nosvn/R.check/r-patched-linux-x86_64/hipread-00check.html), [r-release-linux-x86_64](https://www.r-project.org/nosvn/R.check/r-release-linux-x86_64/hipread-00check.html), [r-release-macos-arm64](https://www.r-project.org/nosvn/R.check/r-release-macos-arm64/hipread-00check.html), [r-release-macos-x86_64](https://www.r-project.org/nosvn/R.check/r-release-macos-x86_64/hipread-00check.html)

Version: 0.2.4
Check: compiled code
Result: NOTE
  File 'hipread/libs/x64/hipread.dll':
    Found non-API calls to R: 'SETLENGTH', 'SET_TRUELENGTH'
  
  Compiled code should not call non-API entry points in R.
  
  See 'Writing portable packages' in the 'Writing R Extensions' manual,
  and section 'Moving into C API compliance' for issues with the use of
  non-API entry points.
Flavors: [r-devel-windows-x86_64](https://www.r-project.org/nosvn/R.check/r-devel-windows-x86_64/hipread-00check.html), [r-release-windows-x86_64](https://www.r-project.org/nosvn/R.check/r-release-windows-x86_64/hipread-00check.html)

The solution I've implemented here involves a fair bit of copying, unfortunately. However, in my testing, it doesn't cause a drastic slowdown in file read time -- as one example, a file with 19 million records and 31 variables that previously took 1.5 minutes to read will now take 2 minutes.

@dtburk
Copy link
Collaborator Author

dtburk commented Jun 11, 2025

@gergness @mpadge if you have time to review these changes, I would greatly appreciate it!

@gergness
Copy link
Contributor

Bummer, I was very proud I of the speed I was able to wring out of hipread, so I'm sad to lose some of it (but I can't imagine it will matter much)!

Your changes look fine to me, though this was definitely copied straight out of the vintage readr package, so I'm way out of my depth.

A few thoughts:

It may make sense to change the rescale factor code here:

if (i >= out[0]->size()) {
// Resize by guessing from the progress bar
resizeAllColumns(out, static_cast<int>((i / data->progress_info().first) * 1.1));
}

and here

hipread/src/read_full.cpp

Lines 174 to 178 in f96c3c8

// Resize by guessing from the progress bar
double resize_scale = 1.1 / data->progress_info().first;
// But make sure you at least 1.5x size because it's better to overallocate
resize_scale = std::max(1.5, resize_scale);
resizeAllColumns(out[rt_index], static_cast<int>(cur_pos_rt[rt_index] * resize_scale));

These rely on it being better to over than under allocate, but I don't think that's true any more. So I think rather than inflating the guess by 1.1X (with a minimum of 1.5X), it may be faster to just use the guess. Ideally you'd benchmark to make sure my intuition is right.

If I remember correctly, a non gzipped, rectangular file will have the guess be exactly correct, but it can be off for hierarchical files (because each record type's line lengths can be different, and the record types may not be uniformly distributed) and gzipped files (because the compression makes the file size per line non-exact).

Also worth noting that both data.table and vctrs have open issues on this, so it could be worth waiting for them to act first? However, I was able to find this data.table blog post, which if I understand correctly means they're planning a whole ALTREP implementation, which seems beyond the scope for this.

Also, ya'll really should get on a parquet extract, it seems like such an obvious win for both R and python users to me!

@gergness
Copy link
Contributor

Was curious how replaceable I am by ChatGPT, and it had 2 suggestions that could be worth pursuing. Don't know how much time you're able to dedicate to this, and I've done no due diligence, but it seems like this is a common enough problem that you shouldn't always have to copy the whole vector.

  1. Rf_lengthgets()
SEXP new_vec = Rf_lengthgets(old_vec, new_size);

This is a semi-supported function in the R API.
It reallocates the vector to a new size, copying over old elements and zero-filling or truncating as needed.
However:

  • It only works on certain vector types (REALSXP, INTSXP, STRSXP, etc.).
  • The resulting vector might be reallocated (i.e., it may not modify in place).
  • May not preserve attributes, depending on context.
    ✅ Safer than SETLENGTH, and often used in base R code, but still not ideal for CRAN, especially when combined with internal manipulations.
  1. Rf_allocVector3() (R >= 4.4.0)
    This is a newer API that allows allocation of a vector with pre-filled memory:
SEXP vec = Rf_allocVector3(TYPE, length, TRUE);  // TRUE = fill with NA
  • Only available in R 4.4.0+
  • Allows more efficient, zero-initialized vector allocation (NA-filled where appropriate).
  • Safer and more modern — but not suitable for older R versions unless you do version checks.

@dtburk
Copy link
Collaborator Author

dtburk commented Jun 16, 2025

Closing this in favor of alternative gergness PR.

@dtburk dtburk closed this Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants