Thanks to Jon and Dan for pointing in the right direction.
arrow::open_dataset()
allows lazy evaluation (docs here), which you can then get the head()
from (but not slice()
), or filter()
. This process is faster, and uses much less peak ram. Example below.
# https://stackoverflow.com/questions/73131505/r-reading-first-n-rows-from-parquet-file
library(dplyr)
library(arrow)
library(tictoc) # optional, used to time results
tic("read all of large parquet file")
my_animals <- read_parquet("data/my_animals.parquet")
toc() # slow and uses heaps of ram
tic("read parquet and write mini version")
my_animals <- open_dataset("data/my_animals.parquet")
my_animals # this is a lazy object
my_animals %>%
#slice(1000L) %>% # doesn't work
head(n=1000L) %>%
# filter(YEAROFBIRTH >= 2010) %>% # also works
compute() %>%
write_parquet("data/my_animals_mini.parquet") # optional
toc() # much faster, much less peak ram used
dplyr::slice_head(n=1000) %>% compute()
?arrow::read_parquet()
does not appear to use lazy evaluation, based on my testing of the time and max memory use to a) read all the file, versus b) a piped implementation ofslice()
as you proposed. - both delivery similar results.arrow::open_dataset()
that will index the parquet dataset and set it up for lazy evaluation. More here: arrow.apache.org/docs/r/articles/dataset.htmlarrow::open_dataset()
appears to allow lazy evaluation. The lazy object is not compatible withslice()
, buthead()
orfilter()
works. A good result - thanks!