Pandas : Reading first n rows from parquet file?

Question

I have a parquet file and I want to read first n rows from the file into a pandas data frame. What I tried:

df = pd.read_parquet(path= 'filepath', nrows = 10)

It did not work and gave me error:

TypeError: read_table() got an unexpected keyword argument 'nrows'

I did try the skiprows argument as well but that also gave me same error.

Alternatively, I can read the complete parquet file and filter the first n rows, but that will require more computations which I want to avoid.

Is there any way to achieve it?

Partial row-wise reads of Parquet files are now possible (using PyArrow as the backend), as shown here: stackoverflow.com/a/69888274/9962007 — mirekphd, Commented Dec 26, 2021 at 10:37

David Kaftan · Accepted Answer · 2021-11-08 18:28:18Z

76

The accepted answer is out of date. It is now possible to read only the first few lines of a parquet file into pandas, though it is a bit messy and backend dependent.

To read using PyArrow as the backend, follow below:

from pyarrow.parquet import ParquetFile
import pyarrow as pa 

pf = ParquetFile('file_name.pq') 
first_ten_rows = next(pf.iter_batches(batch_size = 10)) 
df = pa.Table.from_batches([first_ten_rows]).to_pandas()

Change the line batch_size = 10 to match however many rows you want to read in.

answered Nov 8, 2021 at 18:28

David Kaftan

2,1641 gold badge13 silver badges18 bronze badges

And it is quite fast too (for 1m rows x 2k cols it takes 10 sec.)
– mirekphd
Commented Dec 26, 2021 at 10:35
6

Can we read random rows also?
– ashwin agrawal
Commented Jan 18, 2022 at 23:40
1

And for S3....?
– jtlz2
Commented May 25, 2022 at 12:30
7

Can you also select certain rows to read @DavidKaftan? Example: from a very big parquet file, I'd like to read rows with index 13673, 14762 and 68712 only. How would you do this?
– Basj
Commented Jun 24, 2022 at 10:52
You can do this and many other sql-like operations using S3 Select from the AWS SDK for pandas (formerly AWS data-wrangler). See answer below
– Manuel Montoya
Commented Oct 15, 2022 at 0:43

Add a comment |

Sanchit Kumar · Accepted Answer · 2019-01-02 07:38:08Z

After exploring around and getting in touch with the pandas dev team, the end point is pandas does not support argument nrows or skiprows while reading the parquet file.

The reason being that pandas use pyarrow or fastparquet parquet engines to process parquet file and pyarrow has no support for reading file partially or reading file by skipping rows (not sure about fastparquet). Below is the link of issue on pandas github for discussion.

https://github.com/pandas-dev/pandas/issues/24511

Jvinniec · Accepted Answer · 2023-03-17 19:10:52Z

Querying Parquet with DuckDB

To provide another perspective, if you're comfortable with SQL, you might consider using DuckDB for this. For example:

import duckdb
nrows = 10
file_path = 'path/to/data/parquet_file.parquet'
df = duckdb.query(f'SELECT * FROM "{file_path}" LIMIT {nrows};').df()

If you're working with partitioned parquet, the above result wont include any of the partition columns since that information isn't stored in the lower level files. Instead, you should identify the top folder as a partitioned parquet datasets and register it with a DuckDB connector:

import duckdb
import pyarrow.dataset as ds
nrows = 10
dataset = ds.dataset('path/to/data', 
                     format='parquet',
                     partitioning='hive')
con = duckdb.connect()
con.register('data_table_name', dataset)
df = con.execute(f"SELECT * FROM data_table_name LIMIT {nrows};").df()

You can register multiple datasets with the connector to enable more complex queries. I find DuckDB makes working with parquet files much more convenient, especially when trying to JOIN between multiple Parquet datasets. Install it with conda install python-duckdb or pip install duckdb

Winand · Accepted Answer · 2023-10-24 08:00:52Z

5

Using pyarrow dataset scanner:

import pyarrow.dataset as ds

n = 10
src_path = "/parquet/path"
df = ds.dataset(src_path).scanner().head(n).to_pandas()

edited Oct 24, 2023 at 8:00

answered Feb 9, 2023 at 11:34

Winand

2,4033 gold badges32 silver badges49 bronze badges

Add a comment |

Pavel Prochazka · Accepted Answer · 2023-02-23 13:40:39Z

4

The most straighforward option for me seems to use dask library as

import dask.dataframe as dd
df = dd.read_parquet(path= 'filepath').head(10)

edited Feb 23, 2023 at 13:40

answered Feb 23, 2023 at 13:32

Pavel Prochazka

7639 silver badges13 bronze badges

This isn't a great solution if your file is large, as this will read the entire file and then take the first 10 rows.
– danodonovan
Commented Oct 7 at 21:51

Add a comment |

Manuel Montoya · Accepted Answer · 2022-10-15 00:40:51Z

2

As an alternative you can use S3 Select functionality from AWS SDK for pandas as proposed by Abdel Jaidi in this answer.

pip install awswrangler

import awswrangler as wr

df = wr.s3.select_query(
        sql="SELECT * FROM s3object s limit 5",
        path="s3://filepath",
        input_serialization="Parquet",
        input_serialization_params={},
        use_threads=True,
)

answered Oct 15, 2022 at 0:40

Manuel Montoya

1,41618 silver badges26 bronze badges

the data may not be on s3
– LudvigH
Commented Jul 3, 2023 at 6:44
While this looks like a great approach, it unfortunately will not work for Parquet files with large block sizes.
– user3276159
Commented Sep 21, 2023 at 17:09

Add a comment |

B. M. · Accepted Answer · 2019-01-02 09:25:01Z

-2

Parquet file is column oriented storage, designed for that... So it's normal to load all the file to access just one line.

answered Jan 2, 2019 at 9:25

B. M.

18.6k2 gold badges39 silver badges56 bronze badges

4

Yes, parquet is column based. However, columns are divided into row groups. This means it is possible to only read a part of a parquet file (i. e. one row group). See parquet.apache.org/documentation/latest and arrow.apache.org/docs/python/… E. g. Apache Spark is able to read and process different row groups of the same parquet file on different machines in parallel.
– mrteutone
Commented Nov 18, 2021 at 14:13
However, row groups are pretty large. In Spark/Hadoop, the default group size is 128/256 MB
– shay__
Commented Apr 6, 2022 at 7:45
9

Saying that it is normal isn't very helpful when you get a 10GB worth of file with a billion rows where just 1 million would be more than enough for your needs.
– Alonzorz
Commented May 13, 2022 at 13:41

Add a comment |

Collectives™ on Stack Overflow

Pandas : Reading first n rows from parquet file?

7 Answers 7

Querying Parquet with DuckDB

Not the answer you're looking for? Browse other questions tagged
python
pandas
parquet
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Querying Parquet with DuckDB

Not the answer you're looking for? Browse other questions tagged pythonpandasparquet or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
pandas
parquet
or ask your own question.