-
Notifications
You must be signed in to change notification settings - Fork 85
Description
Bug Description
Jinja2 {{ }} references in LLM column prompts fail when the referenced column is created by a PRE_BATCH processor. The compiler validates templates against the raw seed schema, which doesn't include columns added at runtime by processors.
Steps to Reproduce
- Create a workflow with a seed dataset that has columns
[a, b] - Add a PRE_BATCH processor that creates a new column
c - Define a downstream LLM column whose prompt references
{{ c }} - Run the workflow
Expected Behavior
The compiler should recognize that column c will exist after the PRE_BATCH processor runs, and allow {{ c }} in downstream prompts.
Actual Behavior
The compiler rejects the template because c doesn't exist in the raw seed data. The validation happens at compile time against the raw schema, before any processors run.
Root Cause
The compiler discovers seed columns from the raw seed reader and validates Jinja2 templates against that raw schema. PRE_BATCH processors can add columns at runtime, but the compiler has no way to know about them.
DropColumnsProcessor already implicitly declares removed columns via its config - the builder uses it to mark columns with drop=True at build time. But there's no equivalent mechanism for declaring added columns.
Proposed Fix
PRE_BATCH processors should declare which columns they add/remove so the compiler can compute the post-processor column set:
class ProcessorConfig(ConfigBase):
processor_type: str
columns_added: list[str] = []
columns_removed: list[str] = []The compiler would adjust the column set after seed column discovery: remove declared drops, add SeedDatasetColumnConfig entries for declared additions. Template validation and DAG resolution would then see the final schema.
This only applies to PRE_BATCH processors - POST_BATCH and AFTER_GENERATION processors don't need this since no downstream generators depend on their output schema.