Querying Parquet with DuckDB
To provide another perspective, if you're comfortable with SQL, you might consider using DuckDB for this. For example:
import duckdb
nrows = 10
file_path = 'path/to/data/parquet_file.parquet'
df = duckdb.query(f'SELECT * FROM "{file_path}" LIMIT {nrows};').df()
If you're working with partitioned parquet, the above result wont include any of the partition columns since that information isn't stored in the lower level files. Instead, you should identify the top folder as a partitioned parquet datasets and register it with a DuckDB connector:
import duckdb
import pyarrow.dataset as ds
nrows = 10
dataset = ds.dataset('path/to/data',
format='parquet',
partitioning='hive')
con = duckdb.connect()
con.register('data_table_name', dataset)
df = con.execute(f"SELECT * FROM data_table_name LIMIT {nrows};").df()
You can register multiple datasets with the connector to enable more complex queries. I find DuckDB makes working with parquet files much more convenient, especially when trying to JOIN between multiple Parquet datasets. Install it with conda install python-duckdb
or pip install duckdb