Skip to content

Conversation

@gergness
Copy link
Contributor

Well I was successfully nerd sniped into seeing if chatGPT could do it, but I didn't do a benchmark, I'll leave that up to you @dtburk

@gergness gergness force-pushed the chat-gpt-rf-lengthgets branch from 361dd1b to db850e4 Compare June 13, 2025 02:32
@dtburk
Copy link
Collaborator

dtburk commented Jun 13, 2025

I also relied on ChatGPT for my PR, but I was ambivalent about how openly to proclaim that :)

Now to some benchmarks.

  • cps_00097 has 20,000 records and 14 variables.
  • usa_00128 has 20 million records and 31 variables.

100 iterations of reading cps_00097 with bench::mark

min median
hipread from CRAN 101ms 104ms
dtburk's refactor 121ms 124ms
gergness' refactor 118ms 123ms

1 iteration of reading usa_00128

elapsed
hipread from CRAN 94.25s
dtburk's refactor 111.41s
gergness' refactor 109.38s

More detailed output below, but it appears that in addition to being much simpler, the Rf_lengthgets refactor is ever-so-slightly factor than my more complicated refactor. Unfortunately both are slightly slower than the current version of hipread, but probably not enough to bother users. I will merge this PR on Monday and proceed with a release to CRAN, unless you have any further comments @gergness. Thanks for taking a look and offering a simpler solution!

hipread from CRAN

> bench::mark(
+   read_ipums_micro(ipums_example("cps_00097.xml"), verbose = FALSE),
+   iterations = 100
+ ) |> 
+   dplyr::glimpse()
Rows: 1
Columns: 13
$ expression <bch:expr> <read_ipums_micro(ipums_example("cps_00097.xml"), verbose = FALSE)>
$ min        <bch:tm> 101ms
$ median     <bch:tm> 104ms
$ `itr/sec`  <dbl> 9.55614
$ mem_alloc  <bch:byt> 19.5MB
$ `gc/sec`   <dbl> 9.55614
$ n_itr      <int> 50
$ n_gc       <dbl> 50
$ total_time <bch:tm> 5.23s
$ result     <list> [<tbl_df[20351 x 14]>]
$ memory     <list> [<Rprofmem[7686 x 3]>]
$ time       <list> <112ms, 104ms, 109ms, 105ms, 110ms, 104ms, 109ms, 104ms, 110ms, 104ms…
$ gc         <list> [<tbl_df[100 x 3]>]

> system.time(
+   data <- read_ipums_micro("~/usa_00128.xml")
+ )
Use of data from IPUMS USA is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
|================================================================================| 100%  611 MB
   user  system elapsed 
  92.36    1.62   94.25

dtburk's ChatGPT-assisted refactor

> bench::mark(
+   read_ipums_micro(ipums_example("cps_00097.xml"), verbose = FALSE),
+   iterations = 100
+ ) |> 
+   dplyr::glimpse()
Rows: 1
Columns: 13
$ expression <bch:expr> <read_ipums_micro(ipums_example("cps_00097.xml"),…
$ min        <bch:tm> 121ms
$ median     <bch:tm> 124ms
$ `itr/sec`  <dbl> 8.017142
$ mem_alloc  <bch:byt> 21.7MB
$ `gc/sec`   <dbl> 44.30526
$ n_itr      <int> 19
$ n_gc       <dbl> 105
$ total_time <bch:tm> 2.37s
$ result     <list> [<tbl_df[20351 x 14]>]
$ memory     <list> [<Rprofmem[7799 x 3]>]
$ time       <list> <119ms, 114ms, 121ms, 119ms, 112ms, 114ms, 125ms, 1…
$ gc         <list> [<tbl_df[100 x 3]>]

> system.time(
+   data <- read_ipums_micro("~/usa_00128.xml")
+ )
Use of data from IPUMS USA is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
|============================================================| 100%  611 MB
   user  system elapsed 
 108.45    2.44  111.41

gergness' simpler ChatGPT-assisted refactor

> bench::mark(
+   read_ipums_micro(ipums_example("cps_00097.xml"), verbose = FALSE),
+   iterations = 100
+ ) |> 
+ dplyr::glimpse()
Rows: 1
Columns: 13
$ expression <bch:expr> <read_ipums_micro(ipums_example("cps_00097.xml"), verbose = FALSE)>
$ min        <bch:tm> 118ms
$ median     <bch:tm> 123ms
$ `itr/sec`  <dbl> 8.133421
$ mem_alloc  <bch:byt> 21.7MB
$ `gc/sec`   <dbl> 47.44495
$ n_itr      <int> 18
$ n_gc       <dbl> 105
$ total_time <bch:tm> 2.21s
$ result     <list> [<tbl_df[20351 x 14]>]
$ memory     <list> [<Rprofmem[7843 x 3]>]
$ time       <list> <146ms, 124ms, 132ms, 132ms, 128ms, 132ms, 144ms, 133ms, 129ms, 127ms…
$ gc         <list> [<tbl_df[100 x 3]>]

> system.time(
+   data <- read_ipums_micro("~/usa_00128.xml")
+ )
Use of data from IPUMS USA is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
|================================================================================| 100%  611 MB
   user  system elapsed 
 106.92    2.20  109.38

@gergness
Copy link
Contributor Author

Nope, nothing more to add, sounds good!

@dtburk dtburk merged commit ce1b2d4 into ipums:main Jun 16, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants