Describe the bug
When performing merging files with the ParquetWriter there seems to be an inconsistency with how the dataframe index (event_no) is handled.
I think this is because the ParquetWriter is writing the event_no column and setting it to the index of the dataframe with pandas. Then later when using merge_files(), when reading the writer outputs, polars expects event_no to be a normal column. I think simply not setting it as the index in the writer is enough to solve the issue, but I'm not sure if that would have downstream affects.
To Reproduce
This minimal code should reproduce the error, run it from within the main graphnet folder:
from pathlib import Path
from graphnet.data.dataconverter import DataConverter
from graphnet.data.extractors.prometheus import (
PrometheusFeatureExtractor,
PrometheusTruthExtractor,
)
from graphnet.data.readers import PrometheusReader
from graphnet.data.writers import ParquetWriter
root = Path("./data/tests/prometheus")
outdir = root / "out"
if not outdir.exists():
outdir.mkdir()
converter = DataConverter(
file_reader=PrometheusReader(),
save_method=ParquetWriter(truth_table="mc_truth"),
extractors=[PrometheusTruthExtractor(), PrometheusFeatureExtractor()],
outdir=str(outdir),
num_workers=1,
)
converter(input_dir=str(root))
# fails here
converter.merge_files()
Expected behavior
The final graphnet-ready output is produced without error
Full traceback
Please include the full error message to allow for debugging
graphnet [MainProcess] INFO 2026-03-13 14:49:02 - PrometheusReader.__init__ - Writing log to logs
graphnet [MainProcess] INFO 2026-03-13 14:49:03 - DataConverter.<module> - Merging files to .../out/merged
Traceback (most recent call last):
File "./graphnet/./reproduce_parquet_index_column_bug.py", line 28, in <module>
converter.merge_files()
File "./graphnet/src/graphnet/data/dataconverter.py", line 389, in merge_files
self._save_method.merge_files(
File "./graphnet/src/graphnet/data/writers/parquet_writer.py", line 100, in merge_files
truth_meta = self._identify_events(
^^^^^^^^^^^^^^^^^^^^^^
File "./graphnet/src/graphnet/data/writers/parquet_writer.py", line 152, in _identify_events
df.select([index_column]),
^^^^^^^^^^^^^^^^^^^^^^^^^
File "./graphnet/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 10307, in select
.collect(optimizations=QueryOptFlags._eager())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "./graphnet/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 97, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "./graphnet/.venv/lib/python3.11/site-packages/polars/lazyframe/opt_flags.py", line 326, in wrapper
return function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "./graphnet/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2440, in collect
return wrap_df(ldf.collect(engine, callback))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ColumnNotFoundError: unable to find column "event_no"; valid columns: ["interaction", "initial_state_energy", "initial_state_type", "initial_state_zenith", "initial_state_azimuth", "initial_state_x", "initial_state_y", "initial_state_z"]
Describe the bug
When performing merging files with the
ParquetWriterthere seems to be an inconsistency with how the dataframe index (event_no) is handled.I think this is because the
ParquetWriteris writing theevent_nocolumn and setting it to the index of the dataframe with pandas. Then later when usingmerge_files(), when reading the writer outputs, polars expectsevent_noto be a normal column. I think simply not setting it as the index in the writer is enough to solve the issue, but I'm not sure if that would have downstream affects.To Reproduce
This minimal code should reproduce the error, run it from within the main graphnet folder:
Expected behavior
The final graphnet-ready output is produced without error
Full traceback
Please include the full error message to allow for debugging