Skip to content

Jinja2 templates cannot reference columns created by PRE_BATCH processors #394

@andreatgretel

Description

@andreatgretel

Bug Description

Jinja2 {{ }} references in LLM column prompts fail when the referenced column is created by a PRE_BATCH processor. The compiler validates templates against the raw seed schema, which doesn't include columns added at runtime by processors.

Steps to Reproduce

  1. Create a workflow with a seed dataset that has columns [a, b]
  2. Add a PRE_BATCH processor that creates a new column c
  3. Define a downstream LLM column whose prompt references {{ c }}
  4. Run the workflow

Expected Behavior

The compiler should recognize that column c will exist after the PRE_BATCH processor runs, and allow {{ c }} in downstream prompts.

Actual Behavior

The compiler rejects the template because c doesn't exist in the raw seed data. The validation happens at compile time against the raw schema, before any processors run.

Root Cause

The compiler discovers seed columns from the raw seed reader and validates Jinja2 templates against that raw schema. PRE_BATCH processors can add columns at runtime, but the compiler has no way to know about them.

DropColumnsProcessor already implicitly declares removed columns via its config - the builder uses it to mark columns with drop=True at build time. But there's no equivalent mechanism for declaring added columns.

Proposed Fix

PRE_BATCH processors should declare which columns they add/remove so the compiler can compute the post-processor column set:

class ProcessorConfig(ConfigBase):
    processor_type: str
    columns_added: list[str] = []
    columns_removed: list[str] = []

The compiler would adjust the column set after seed column discovery: remove declared drops, add SeedDatasetColumnConfig entries for declared additions. Template validation and DAG resolution would then see the final schema.

This only applies to PRE_BATCH processors - POST_BATCH and AFTER_GENERATION processors don't need this since no downstream generators depend on their output schema.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions