Resize with `rf_lengthgets()` #9

gergness · 2025-06-13T02:26:45Z

Well I was successfully nerd sniped into seeing if chatGPT could do it, but I didn't do a benchmark, I'll leave that up to you @dtburk

…GTH and SET_TRUELENGTH

dtburk · 2025-06-13T17:25:13Z

I also relied on ChatGPT for my PR, but I was ambivalent about how openly to proclaim that :)

Now to some benchmarks.

cps_00097 has 20,000 records and 14 variables.
usa_00128 has 20 million records and 31 variables.

100 iterations of reading cps_00097 with bench::mark

	min	median
hipread from CRAN	101ms	104ms
dtburk's refactor	121ms	124ms
gergness' refactor	118ms	123ms

1 iteration of reading usa_00128

	elapsed
hipread from CRAN	94.25s
dtburk's refactor	111.41s
gergness' refactor	109.38s

More detailed output below, but it appears that in addition to being much simpler, the Rf_lengthgets refactor is ever-so-slightly factor than my more complicated refactor. Unfortunately both are slightly slower than the current version of hipread, but probably not enough to bother users. I will merge this PR on Monday and proceed with a release to CRAN, unless you have any further comments @gergness. Thanks for taking a look and offering a simpler solution!

hipread from CRAN

> bench::mark(
+   read_ipums_micro(ipums_example("cps_00097.xml"), verbose = FALSE),
+   iterations = 100
+ ) |> 
+   dplyr::glimpse()
Rows: 1
Columns: 13
$ expression <bch:expr> <read_ipums_micro(ipums_example("cps_00097.xml"), verbose = FALSE)>
$ min        <bch:tm> 101ms
$ median     <bch:tm> 104ms
$ `itr/sec`  <dbl> 9.55614
$ mem_alloc  <bch:byt> 19.5MB
$ `gc/sec`   <dbl> 9.55614
$ n_itr      <int> 50
$ n_gc       <dbl> 50
$ total_time <bch:tm> 5.23s
$ result     <list> [<tbl_df[20351 x 14]>]
$ memory     <list> [<Rprofmem[7686 x 3]>]
$ time       <list> <112ms, 104ms, 109ms, 105ms, 110ms, 104ms, 109ms, 104ms, 110ms, 104ms…
$ gc         <list> [<tbl_df[100 x 3]>]

> system.time(
+   data <- read_ipums_micro("~/usa_00128.xml")
+ )
Use of data from IPUMS USA is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
|================================================================================| 100%  611 MB
   user  system elapsed 
  92.36    1.62   94.25

dtburk's ChatGPT-assisted refactor

> bench::mark(
+   read_ipums_micro(ipums_example("cps_00097.xml"), verbose = FALSE),
+   iterations = 100
+ ) |> 
+   dplyr::glimpse()
Rows: 1
Columns: 13
$ expression <bch:expr> <read_ipums_micro(ipums_example("cps_00097.xml"),…
$ min        <bch:tm> 121ms
$ median     <bch:tm> 124ms
$ `itr/sec`  <dbl> 8.017142
$ mem_alloc  <bch:byt> 21.7MB
$ `gc/sec`   <dbl> 44.30526
$ n_itr      <int> 19
$ n_gc       <dbl> 105
$ total_time <bch:tm> 2.37s
$ result     <list> [<tbl_df[20351 x 14]>]
$ memory     <list> [<Rprofmem[7799 x 3]>]
$ time       <list> <119ms, 114ms, 121ms, 119ms, 112ms, 114ms, 125ms, 1…
$ gc         <list> [<tbl_df[100 x 3]>]

> system.time(
+   data <- read_ipums_micro("~/usa_00128.xml")
+ )
Use of data from IPUMS USA is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
|============================================================| 100%  611 MB
   user  system elapsed 
 108.45    2.44  111.41

gergness' simpler ChatGPT-assisted refactor

> bench::mark(
+   read_ipums_micro(ipums_example("cps_00097.xml"), verbose = FALSE),
+   iterations = 100
+ ) |> 
+ dplyr::glimpse()
Rows: 1
Columns: 13
$ expression <bch:expr> <read_ipums_micro(ipums_example("cps_00097.xml"), verbose = FALSE)>
$ min        <bch:tm> 118ms
$ median     <bch:tm> 123ms
$ `itr/sec`  <dbl> 8.133421
$ mem_alloc  <bch:byt> 21.7MB
$ `gc/sec`   <dbl> 47.44495
$ n_itr      <int> 18
$ n_gc       <dbl> 105
$ total_time <bch:tm> 2.21s
$ result     <list> [<tbl_df[20351 x 14]>]
$ memory     <list> [<Rprofmem[7843 x 3]>]
$ time       <list> <146ms, 124ms, 132ms, 132ms, 128ms, 132ms, 144ms, 133ms, 129ms, 127ms…
$ gc         <list> [<tbl_df[100 x 3]>]

> system.time(
+   data <- read_ipums_micro("~/usa_00128.xml")
+ )
Use of data from IPUMS USA is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
|================================================================================| 100%  611 MB
   user  system elapsed 
 106.92    2.20  109.38

gergness · 2025-06-13T18:29:41Z

Nope, nothing more to add, sounds good!

dtburk and others added 2 commits June 11, 2025 15:16

Refactor column resizing code to avoid using non-API functions SETLEN…

f96c3c8

…GTH and SET_TRUELENGTH

another take at resizing

db850e4

gergness force-pushed the chat-gpt-rf-lengthgets branch from 361dd1b to db850e4 Compare June 13, 2025 02:32

dtburk merged commit ce1b2d4 into ipums:main Jun 16, 2025
5 checks passed

dtburk mentioned this pull request Jun 16, 2025

Non-API calls SETLENGTH and SET_TRUELENGTH #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Resize with `rf_lengthgets()` #9

Resize with `rf_lengthgets()` #9

Uh oh!

gergness commented Jun 13, 2025

Uh oh!

dtburk commented Jun 13, 2025

Uh oh!

gergness commented Jun 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Resize with rf_lengthgets() #9

Resize with rf_lengthgets() #9

Uh oh!

Conversation

gergness commented Jun 13, 2025

Uh oh!

dtburk commented Jun 13, 2025

100 iterations of reading cps_00097 with bench::mark

1 iteration of reading usa_00128

hipread from CRAN

dtburk's ChatGPT-assisted refactor

gergness' simpler ChatGPT-assisted refactor

Uh oh!

gergness commented Jun 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Resize with `rf_lengthgets()` #9

Resize with `rf_lengthgets()` #9