R: Reading first n rows from parquet file?

Question

I realise parquet is a column format, but with large files, sometimes you don't want to read it all to memory in R before filtering, and the first 1000 or so rows may be enough for testing. I don't see an option in the read parquet documentation here.

I see a solution for pandas here, and an option for c# here, both of which are not obvious to me how they might translate to R. Suggestions?

Looking through the docs, it seems like arrow gives lazy evaluation. So maybe you can dplyr::slice_head(n=1000) %>% compute()? — Dan Adams, Commented Jul 27, 2022 at 2:19
Unfortunately arrow::read_parquet() does not appear to use lazy evaluation, based on my testing of the time and max memory use to a) read all the file, versus b) a piped implementation of slice() as you proposed. - both delivery similar results. — Mark Neal, Commented Jul 27, 2022 at 3:02
I think if you use arrow::open_dataset() that will index the parquet dataset and set it up for lazy evaluation. More here: arrow.apache.org/docs/r/articles/dataset.html — Jon Spring, Commented Jul 27, 2022 at 3:04
@Jon is correct, arrow::open_dataset() appears to allow lazy evaluation. The lazy object is not compatible with slice() , but head() or filter() works. A good result - thanks! — Mark Neal, Commented Jul 27, 2022 at 3:44

Liang Zhang · Accepted Answer · 2023-12-07 06:06:39Z

Thanks to Jon and Dan for pointing in the right direction.

arrow::open_dataset() allows lazy evaluation (docs here), which you can then get the head() from (but not slice()), or filter(). This process is faster, and uses much less peak ram. Example below.

# https://stackoverflow.com/questions/73131505/r-reading-first-n-rows-from-parquet-file

library(dplyr)
library(arrow)
library(tictoc) # optional, used to time results

tic("read all of large parquet file")
my_animals <- read_parquet("data/my_animals.parquet")
toc() # slow and uses heaps of ram

tic("read parquet and write mini version")
my_animals <- open_dataset("data/my_animals.parquet") 
my_animals # this is a lazy object

my_animals %>% 
  #slice(1000L) %>% # doesn't work
  head(n=1000L) %>% 
  # filter(YEAROFBIRTH >= 2010) %>% # also works
  compute() %>% 
  write_parquet("data/my_animals_mini.parquet") # optional
toc() # much faster, much less peak ram used

Moohan · Accepted Answer · 2023-07-18 11:24:34Z

You can use the as_data_frame argument of read_parquet to return the data as an 'Arrow Table' object. You can then use {dplyr} functions on this object, followed by dplyr::collect (collect will return the tibble object, whereas compute merely forces the computation).

library(dplyr)
library(arrow)

my_animals <- read_parquet("data/my_animals.parquet", as_data_frame = FALSE) |>
  slice_head(n = 1000) |>
  collect()

This is readable, fast and memory efficient!

See https://arrow.apache.org/docs/r/articles/data_wrangling.html for more info.

korayp · Accepted Answer · 2023-02-18 17:18:07Z

0

I published this simple package for practical usage. https://github.com/mkparkin/Rinvent feel free to check if that can help. There is a parameter called "sample" which brings sample rows. also it can read "delta" files as well

readparquetR(pathtoread="C:/users/...", format="delta", sample=10) or readparquetR(pathtoread="C:/users/...", format="parquet", sample=10)

edited Feb 18, 2023 at 17:18

answered Feb 18, 2023 at 17:15

korayp

375 bronze badges

Add a comment |

Collectives™ on Stack Overflow

R: Reading first n rows from parquet file?

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
r
parquet
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged rparquet or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
r
parquet
or ask your own question.