Skip to content

ParquetWriter fails when merging files #871

@astrojarred

Description

@astrojarred

Describe the bug
When performing merging files with the ParquetWriter there seems to be an inconsistency with how the dataframe index (event_no) is handled.

I think this is because the ParquetWriter is writing the event_no column and setting it to the index of the dataframe with pandas. Then later when using merge_files(), when reading the writer outputs, polars expects event_no to be a normal column. I think simply not setting it as the index in the writer is enough to solve the issue, but I'm not sure if that would have downstream affects.

To Reproduce
This minimal code should reproduce the error, run it from within the main graphnet folder:

from pathlib import Path
from graphnet.data.dataconverter import DataConverter
from graphnet.data.extractors.prometheus import (
    PrometheusFeatureExtractor,
    PrometheusTruthExtractor,
)
from graphnet.data.readers import PrometheusReader
from graphnet.data.writers import ParquetWriter


root = Path("./data/tests/prometheus")

outdir = root / "out"
if not outdir.exists():
    outdir.mkdir()

converter = DataConverter(
    file_reader=PrometheusReader(),
    save_method=ParquetWriter(truth_table="mc_truth"),
    extractors=[PrometheusTruthExtractor(), PrometheusFeatureExtractor()],
    outdir=str(outdir),
    num_workers=1,
)
converter(input_dir=str(root))

# fails here
converter.merge_files()

Expected behavior
The final graphnet-ready output is produced without error

Full traceback
Please include the full error message to allow for debugging

graphnet [MainProcess] INFO     2026-03-13 14:49:02 - PrometheusReader.__init__ - Writing log to logs
graphnet [MainProcess] INFO     2026-03-13 14:49:03 - DataConverter.<module> - Merging files to .../out/merged
Traceback (most recent call last):
  File "./graphnet/./reproduce_parquet_index_column_bug.py", line 28, in <module>
    converter.merge_files()
  File "./graphnet/src/graphnet/data/dataconverter.py", line 389, in merge_files
    self._save_method.merge_files(
  File "./graphnet/src/graphnet/data/writers/parquet_writer.py", line 100, in merge_files
    truth_meta = self._identify_events(
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "./graphnet/src/graphnet/data/writers/parquet_writer.py", line 152, in _identify_events
    df.select([index_column]),
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./graphnet/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 10307, in select
    .collect(optimizations=QueryOptFlags._eager())
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./graphnet/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py", line 97, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./graphnet/.venv/lib/python3.11/site-packages/polars/lazyframe/opt_flags.py", line 326, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./graphnet/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2440, in collect
    return wrap_df(ldf.collect(engine, callback))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ColumnNotFoundError: unable to find column "event_no"; valid columns: ["interaction", "initial_state_energy", "initial_state_type", "initial_state_zenith", "initial_state_azimuth", "initial_state_x", "initial_state_y", "initial_state_z"]

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions