Skip to content

[FEATURE] Reduce boilerplate for adding new PPL commands and functions #4960

@dai-chen

Description

@dai-chen

Is your feature request related to a problem?

Adding new PPL commands or functions requires extensive boilerplate code across multiple layers of the codebase, making it error-prone and time-consuming. Developers must manually:

  1. Update ANTLR grammar files (lexer and parser)
  2. Create AST node classes and implement AST/RelNode visitor pattern methods
  3. Register functions in multiple locations (PPLBuiltinOperators, PPLFuncImpTable, BuiltinFunctionName)
  4. Create unit tests that validate the translation pipeline and Spark SQL generated
  5. Create multiple integration test classes (IT, Yaml IT, ExplainIT, AnonymousIT, CrossClusterIT)
  6. Remember engine-specific setup (e.g., enableCalcite() for V3-only functions)
  7. Update documentation following the latest doc structure

This repetitive process leads to:

  • Human errors: Easy to miss required steps or files - some are subtle even if reference similar PRs.
  • Inconsistency: Different developers may structure code differently.
  • Slow onboarding: New contributors face a steep learning curve.
  • Maintenance burden: Changes to architecture require updating many files.

What solution would you like?

Below are several ideas that address the problem from different levels (tooling, specs, and architecture):

  1. Interactive scaffolding tool: Add a Gradle task that generates the required files and code insertions from a few prompts, with a strict “expected files changed” checklist to prevent missed steps (human or AI) and copy-paste drift.

  2. Spec-driven code generation: Define a YAML/Markdown spec as the single source of truth for commands/functions (metadata, signatures, engine support, tests/docs) and generate the repetitive glue at build time—AI-friendly by design.

  3. Better abstractions: Introduce a small DSL for AST → RelNode translation and shared test base classes/mixins to standardize setup. This reduces long-term maintenance and makes both scaffolding and codegen simpler and safer.

What alternatives have you considered?

  • Better documentation only: Document the process better, but doesn't prevent human errors
  • Copy-paste from examples: Current approach, leads to inconsistencies and forgotten steps
  • Code review checklists: Helpful but reactive - catches errors after they're made

Do you have any additional context?

Metadata

Metadata

Assignees

No one assigned

    Labels

    maintenanceImproves code quality, but not the product

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions