Skip to content

[FEATURE] Add Union command in PPL #5240

Open
srikanthpadakanti wants to merge 1 commit intoopensearch-project:mainfrom
srikanthpadakanti:feature/union-public
Open

[FEATURE] Add Union command in PPL #5240
srikanthpadakanti wants to merge 1 commit intoopensearch-project:mainfrom
srikanthpadakanti:feature/union-public

Conversation

@srikanthpadakanti
Copy link
Contributor

Description

Add union command to PPL that implements SQL-style UNION ALL semantics with Calcite-based type coercion. The command supports combining multiple datasets (indices, patterns, aliases, or subsearches) with automatic schema merging and missing fields are filled with NULL, compatible types are coerced to a common supertype (e.g., int+float --> float), and incompatible types fall back to string. Works both as a first command and mid-pipeline, where the upstream result set is implicitly included as the first dataset.

Related Issues

Resolves #5110
#5110

Check List

  • [ X ] New functionality includes testing.
  • [ X ] New functionality has been documented.
  • [ X ] New functionality has javadoc added.
  • [ X ] New functionality has a user manual doc added.
  • [ X ] New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • [ X ] Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 17, 2026

PR Reviewer Guide 🔍

(Review updated until commit 9ec4314)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: AST, Parser, and Anonymizer support for Union command

Relevant files:

  • core/src/main/java/org/opensearch/sql/ast/tree/Union.java
  • core/src/main/java/org/opensearch/sql/ast/AbstractNodeVisitor.java
  • core/src/main/java/org/opensearch/sql/analysis/Analyzer.java
  • ppl/src/main/antlr/OpenSearchPPLLexer.g4
  • ppl/src/main/antlr/OpenSearchPPLParser.g4
  • ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java
  • ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java
  • ppl/src/test/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizerTest.java
  • integ-test/src/test/java/org/opensearch/sql/ppl/NewAddedCommandsIT.java

Sub-PR theme: Calcite execution engine and integration tests for Union command

Relevant files:

  • core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java
  • core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLUnionTest.java
  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteUnionCommandIT.java
  • integ-test/src/test/java/org/opensearch/sql/calcite/CalciteNoPushdownIT.java
  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
  • integ-test/src/test/java/org/opensearch/sql/security/CalciteCrossClusterSearchIT.java
  • integ-test/src/test/resources/expectedOutput/calcite/explain_union.yaml
  • integ-test/src/test/resources/expectedOutput/calcite_no_pushdown/explain_union.yaml

Sub-PR theme: Documentation for Union command

Relevant files:

  • docs/user/ppl/cmd/union.md
  • docs/user/ppl/index.md
  • docs/category.json

⚡ Recommended focus areas for review

Field Name Collision

In buildUnifiedSchemaForUnion, the schema is built by iterating over all inputs and adding fields only if not already seen (by name). However, the type stored in seenFields is from the first input that introduces the field, not the coerced common type. If coerceUnionTypes has already aligned types, this should be fine, but if a field appears in only one input (missing in others), the NULL literal type is derived from the first-seen type. This could cause subtle issues if the coercion step and the schema-building step are not perfectly aligned.

private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
  List<SchemaField> schema = new ArrayList<>();
  Map<String, RelDataType> seenFields = new HashMap<>();

  for (RelNode node : nodes) {
    for (RelDataTypeField field : node.getRowType().getFieldList()) {
      if (!seenFields.containsKey(field.getName())) {
        schema.add(new SchemaField(field.getName(), field.getType()));
        seenFields.put(field.getName(), field.getType());
      }
    }
  }
  return schema;
}
FLOAT/DECIMAL Rank

In getNumericTypeRankForUnion, DECIMAL has rank 5 while FLOAT has rank 7 and DOUBLE has rank 8. This means INTEGER+FLOAT coerces to FLOAT (correct), but DECIMAL+FLOAT coerces to FLOAT, and DECIMAL+DOUBLE coerces to DOUBLE. DECIMAL is a fixed-precision type and widening it to FLOAT/DOUBLE can cause precision loss. This ranking may produce unexpected results for users combining DECIMAL fields with floating-point fields.

private static int getNumericTypeRankForUnion(SqlTypeName typeName) {
  return switch (typeName) {
    case TINYINT -> 1;
    case SMALLINT -> 2;
    case INTEGER -> 3;
    case BIGINT -> 4;
    case DECIMAL -> 5;
    case REAL -> 6;
    case FLOAT -> 7;
    case DOUBLE -> 8;
    default -> 0;
  };
}
Single Dataset Validation

The check if (inputNodes.size() < 2) in visitUnion throws an IllegalArgumentException at planning time. However, when union is used mid-pipeline, the upstream node is attached via Union.attach() which prepends it to the datasets list. The validation in visitUnionCommand in AstBuilder allows 1+ datasets (for mid-pipeline use). It should be verified that the attach mechanism always results in at least 2 nodes reaching visitUnion, otherwise the error message may be confusing since it says "Provided: 1" but the user specified a valid mid-pipeline union with one explicit dataset.

if (inputNodes.size() < 2) {
  throw new IllegalArgumentException(
      "Union command requires at least two datasets. Provided: " + inputNodes.size());
}
Regex Keyword List

The anonymizeSubsearchCommand method builds a dynamic regex keyword exclusion list that includes the commandName parameter. For union, this adds union to the exclusion list, but the list is constructed via string concatenation without escaping. If a future command name contains regex special characters, this could break the regex. Additionally, the keyword list is rebuilt on every call; consider extracting it as a constant or computing it once.

private String anonymizeSubsearchCommand(String commandName, List<UnresolvedPlan> subsearches) {
  String keywords =
      "source|fields|where|stats|head|tail|sort|eval|rename|"
          + commandName
          + "|search|table|identifier|\\*\\*\\*";
  List<String> anonymizedSubsearches = new ArrayList<>();

  for (UnresolvedPlan subsearch : subsearches) {
    String anonymizedSubsearch = anonymizeData(subsearch);
    anonymizedSubsearch = "search " + anonymizedSubsearch;
    anonymizedSubsearch =
        anonymizedSubsearch
            .replaceAll("\\bsource=\\w+", "source=table")
            .replaceAll("\\b(?!" + keywords + ")\\w+(?=\\s*[<>=!])", "identifier")
            .replaceAll("\\b(?!" + keywords + ")\\w+(?=\\s*,)", "identifier")
            .replaceAll("fields \\+\\s*\\b(?!" + keywords + ")\\w+", "fields + identifier")
            .replaceAll(
                "fields \\+\\s*identifier,\\s*\\b(?!" + keywords + ")\\w+",
                "fields + identifier,identifier");

@github-actions
Copy link
Contributor

github-actions bot commented Mar 17, 2026

PR Code Suggestions ✨

Latest suggestions up to 9ec4314

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Include all command names in anonymizer keyword exclusion list

The keywords regex is built by concatenating commandName directly into a regex
alternation. If commandName contains regex special characters (unlikely for current
values like "union" or "multisearch", but possible for future commands), this could
break the regex. Additionally, the keyword list does not include union when
anonymizing multisearch commands and vice versa, which could cause command names to
be incorrectly replaced with identifier. Both command names should always be in the
exclusion list.

ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java [790-794]

 private String anonymizeSubsearchCommand(String commandName, List<UnresolvedPlan> subsearches) {
   String keywords =
       "source|fields|where|stats|head|tail|sort|eval|rename|"
-          + commandName
+          + "union|multisearch"
           + "|search|table|identifier|\\*\\*\\*";
Suggestion importance[1-10]: 5

__

Why: This is a valid bug — when anonymizing a union command, the keyword multisearch is not in the exclusion list (and vice versa), which could cause command names appearing in subsearch content to be incorrectly replaced with identifier. The fix is straightforward and correct, hardcoding both command names instead of dynamically inserting only the current one.

Low
Validate dataset count before processing inputs

When union is used mid-pipeline (e.g., search source=X | union [...]), the upstream
plan is attached via attach() and becomes part of node.getDatasets(). However, if
only one explicit dataset is provided in the grammar (which is allowed by the
parser), inputNodes will have size 1 after the loop, and the error is thrown after
already consuming the relBuilder state. The size check should happen before
processing datasets, using the total expected count (datasets + possible implicit
upstream), or the grammar should enforce at least 2 datasets at parse time.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2472-2481]

+// Validate before processing to avoid partial relBuilder state corruption
+// When used mid-pipeline, attach() prepends the upstream, so datasets.size() >= 1 is valid
+// but we still need at least 2 total inputs for a union
+if (node.getDatasets().size() < 2) {
+  throw new IllegalArgumentException(
+      "Union command requires at least two datasets. Provided: " + node.getDatasets().size());
+}
+
 for (UnresolvedPlan dataset : node.getDatasets()) {
   UnresolvedPlan prunedDataset = dataset.accept(new EmptySourcePropagateVisitor(), null);
   prunedDataset.accept(this, context);
   inputNodes.add(context.relBuilder.build());
 }
 
-if (inputNodes.size() < 2) {
-  throw new IllegalArgumentException(
-      "Union command requires at least two datasets. Provided: " + inputNodes.size());
-}
-
Suggestion importance[1-10]: 4

__

Why: Moving the validation before processing avoids partial relBuilder state corruption, which is a valid concern. However, the mid-pipeline case (where attach() prepends the upstream) means node.getDatasets() already contains all datasets including the implicit upstream, so the check at inputNodes.size() < 2 after the loop is functionally equivalent. The suggestion has merit but the impact is limited.

Low
Remove redundant constructor annotation

The class has both @RequiredArgsConstructor and @AllArgsConstructor. Since datasets
is final and maxout is not, @RequiredArgsConstructor generates a constructor with
only datasets, while @AllArgsConstructor generates one with both. However, the
attach method calls new Union(newDatasets, maxout) which uses the all-args
constructor. If maxout is null (the common case), the single-arg constructor new
Union(datasets) would be used elsewhere, but having both annotations is redundant
and can cause confusion. Consider removing @RequiredArgsConstructor since
@AllArgsConstructor covers all cases, and use new Union(newDatasets, maxout)
consistently.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
-@RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
 
   private Integer maxout;
Suggestion importance[1-10]: 3

__

Why: Having both @RequiredArgsConstructor and @AllArgsConstructor is indeed redundant since @RequiredArgsConstructor generates a constructor with only datasets (the final field) while @AllArgsConstructor covers both fields. Removing @RequiredArgsConstructor simplifies the code, but this is a minor style improvement with low functional impact.

Low
Possible issue
Unified schema should reflect widest coerced type

The unified schema uses the type from the first node that introduces a field, but
after coerceUnionTypes runs, subsequent nodes may have had their types widened to a
common supertype. The schema builder should use the coerced (target) type for each
field rather than the first-seen type, otherwise buildProjectionForUnion may emit
unnecessary CASTs or miss type mismatches. Consider deriving the unified schema from
the coerced target type map instead of the first-seen field type.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
-  Map<String, RelDataType> seenFields = new HashMap<>();
+  Map<String, RelDataType> seenFields = new LinkedHashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        seenFields.put(name, field.getType());
+        schema.add(new SchemaField(name, field.getType()));
+      } else {
+        // Update to the widest type seen (post-coercion nodes may have wider types)
+        RelDataType existing = seenFields.get(name);
+        if (!existing.equals(field.getType())) {
+          // Replace with the current node's type if it is wider (coercion already applied)
+          seenFields.put(name, field.getType());
+          schema.set(schema.indexOf(new SchemaField(name, existing)), new SchemaField(name, field.getType()));
+        }
       }
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 4

__

Why: The concern is valid in theory — after coerceUnionTypes, all nodes should have the same type for shared fields, so the first-seen type should already be the coerced type. The suggested fix using indexOf with a new SchemaField object is fragile and may not work correctly if equals isn't defined properly. The issue is real but the proposed solution has its own problems.

Low

Previous suggestions

Suggestions up to commit 3782336
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix relBuilder stack state after maxout limit

After applying LogicalSystemLimit, the node is pushed onto the builder stack but
peek() is called without a prior build() on the non-maxout path. When maxout is
null, context.relBuilder.peek() returns the union node correctly, but when maxout is
not null, the LogicalSystemLimit is pushed but never built/returned consistently.
The method should always return context.relBuilder.peek() after ensuring the final
node is on the stack, which it does — but the push after build() inside the maxout
branch leaves the builder in an inconsistent state for subsequent operations since
the built node is pushed back.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2486-2499]

 for (RelNode input : unifiedInputs) {
   context.relBuilder.push(input);
 }
 context.relBuilder.union(true, unifiedInputs.size()); // true = UNION ALL
 
 if (node.getMaxout() != null) {
+  RelNode unionNode = context.relBuilder.build();
   context.relBuilder.push(
       LogicalSystemLimit.create(
           LogicalSystemLimit.SystemLimitType.SUBSEARCH_MAXOUT,
-          context.relBuilder.build(),
+          unionNode,
           context.relBuilder.literal(node.getMaxout())));
 }
 
 return context.relBuilder.peek();
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that calling context.relBuilder.build() inside LogicalSystemLimit.create() consumes the union node from the stack, and then push puts the limit node back. The improved code explicitly separates build() from the push, making the stack management clearer and less error-prone for future maintenance.

Low
Ensure unified schema uses coerced types

The unified schema is built using the type from the first node that introduces a
field, but after coerceUnionTypes runs, subsequent nodes may have had their types
changed. The schema should reflect the coerced (common) type, not just the
first-seen type. This can cause type mismatches when projecting fields from later
nodes, leading to incorrect CASTs or runtime errors.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
   Map<String, RelDataType> seenFields = new HashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        schema.add(new SchemaField(name, field.getType()));
+        seenFields.put(name, field.getType());
       }
+      // After coercion, all nodes sharing a field name should have the same type;
+      // the first-seen type is already the common type post-coercion.
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion claims that buildUnifiedSchemaForUnion uses pre-coercion types, but since coerceUnionTypes is called first and returns new RelNode objects with updated types, the first-seen type in the coerced inputs is already the common type. The improved code is functionally identical to the existing code with only a comment added, making this a low-impact suggestion.

Low
General
Clarify constructor design for immutable fields

Using both @RequiredArgsConstructor and @AllArgsConstructor together with a mix of
final and non-final fields creates two constructors: one with only datasets and one
with both datasets and maxout. However, @RequiredArgsConstructor generates a
constructor for final fields only, so maxout (non-final) is excluded. This is
intentional but fragile — if maxout is ever made final, the behavior changes
silently. Consider making maxout final and using a single @AllArgsConstructor, or
document the intent clearly.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
-@RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
+  private final Integer maxout;
 
-  private Integer maxout;
+  public Union(List<UnresolvedPlan> datasets) {
+    this(datasets, null);
+  }
Suggestion importance[1-10]: 3

__

Why: The suggestion to make maxout final and use @AllArgsConstructor with an explicit no-maxout constructor improves immutability and clarity. However, this is a minor design preference and the existing dual-annotation approach with @RequiredArgsConstructor and @AllArgsConstructor is a common Lombok pattern that works correctly.

Low
Handle qualified dataset name parsing correctly

When parsing a qualifiedDatasetName, getText() returns the raw concatenated text
including any COLON and DOT separators (e.g., cluster:index.alias becomes
cluster:index.alias as a single string). This is passed directly to qualifiedName,
which may not correctly handle cross-cluster or dotted index names. The name should
be split on the appropriate delimiters to construct a proper qualified name with
separate parts.

ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java [1355-1362]

 for (OpenSearchPPLParser.UnionDatasetContext datasetCtx : ctx.unionDataset()) {
   if (datasetCtx.subSearch() != null) {
     datasets.add(visitSubSearch(datasetCtx.subSearch()));
   } else if (datasetCtx.qualifiedDatasetName() != null) {
+    // Use the full text as a single qualified name part to preserve
+    // index patterns (e.g., logs-*, cluster:index) correctly
     String datasetName = datasetCtx.qualifiedDatasetName().getText();
     datasets.add(new Relation(Collections.singletonList(AstDSL.qualifiedName(datasetName))));
   }
 }
Suggestion importance[1-10]: 1

__

Why: The improved code is functionally identical to the existing code — it keeps the same getText() call and qualifiedName usage, only adding a comment. The suggestion raises a valid concern about cross-cluster name parsing but doesn't actually fix anything, making it a score of 1.

Low
Suggestions up to commit f05fbdb
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix maxout limit node not pushed onto builder stack

After calling context.relBuilder.build() inside the if (node.getMaxout() != null)
block, the result is passed to LogicalSystemLimit.create but never pushed back onto
the builder stack. The subsequent context.relBuilder.peek() will then return the
union node rather than the limit node, causing the maxout limit to be silently
dropped from the plan.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2467-2480]

 for (RelNode input : unifiedInputs) {
   context.relBuilder.push(input);
 }
 context.relBuilder.union(true, unifiedInputs.size()); // true = UNION ALL
 
 if (node.getMaxout() != null) {
+  RelNode unionNode = context.relBuilder.build();
   context.relBuilder.push(
       LogicalSystemLimit.create(
           LogicalSystemLimit.SystemLimitType.SUBSEARCH_MAXOUT,
-          context.relBuilder.build(),
+          unionNode,
           context.relBuilder.literal(node.getMaxout())));
 }
 
 return context.relBuilder.peek();
Suggestion importance[1-10]: 8

__

Why: This is a real bug: context.relBuilder.build() pops the union node from the stack and passes it to LogicalSystemLimit.create, but the resulting limit node is pushed via context.relBuilder.push(...) correctly. Looking more carefully at the code, the push call does push the limit node, so peek() would return it. The suggestion's premise may be incorrect — the existing code does push the limit node. Score lowered accordingly, but the suggestion highlights a subtle flow worth verifying.

Medium
Fix unified schema using post-coercion types

The unified schema is built using the type from the first node that introduces a
field, but after coerceUnionTypes runs, subsequent nodes may have the same field
with a different (coerced) type. The schema should use the coerced/target type for
each field, not just the first-seen type. This can cause type mismatches in the
subsequent buildProjectionForUnion cast logic.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
   Map<String, RelDataType> seenFields = new HashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        schema.add(new SchemaField(name, field.getType()));
+        seenFields.put(name, field.getType());
+      } else {
+        // Update schema entry if a wider type is found (post-coercion nodes may differ)
+        RelDataType existing = seenFields.get(name);
+        if (!existing.equals(field.getType())) {
+          // Replace with the current type if it differs (coercion should have unified already)
+          seenFields.put(name, field.getType());
+          schema.replaceAll(sf -> sf.getName().equals(name) ? new SchemaField(name, field.getType()) : sf);
+        }
       }
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 4

__

Why: The concern is valid in theory — buildUnifiedSchemaForUnion uses the first-seen type, which may differ from the coerced type in other nodes. However, since coerceUnionTypes is called first and should unify types across all inputs before unifySchemasForUnion runs, the first-seen type after coercion should already be the common type. The suggested fix adds complexity and may not address a real bug in practice.

Low
General
Include all command names in anonymizer keyword exclusion list

The keywords regex negative lookahead is built by concatenating commandName directly
into the regex pattern. If commandName contains regex special characters, this could
break the pattern. Additionally, when visitMultisearch calls this method, the
keyword multisearch is included, but when visitUnion calls it, only union is added —
neither includes the other command name, so one command's name could be incorrectly
anonymized as identifier in the other's output.

ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java [790-794]

 private String anonymizeSubsearchCommand(String commandName, List<UnresolvedPlan> subsearches) {
   String keywords =
       "source|fields|where|stats|head|tail|sort|eval|rename|"
-          + commandName
+          + "multisearch|union"
           + "|search|table|identifier|\\*\\*\\*";
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that when anonymizing a union subsearch, the keyword multisearch is not in the exclusion list (and vice versa), which could cause command names to be incorrectly replaced with identifier. Hardcoding all command names is a reasonable fix, though it requires maintenance as new commands are added.

Low
Make maxout field immutable and constructors explicit

Using both @RequiredArgsConstructor and @AllArgsConstructor together with a mix of
final and non-final fields creates two constructors: one with only datasets (from
@RequiredArgsConstructor) and one with both fields (from @AllArgsConstructor).
However, maxout is non-final and mutable, which can lead to inconsistent state.
Consider making maxout final and removing @RequiredArgsConstructor, or using a
single explicit constructor to avoid ambiguity.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
-@RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
+  private final Integer maxout;
 
-  private Integer maxout;
+  public Union(List<UnresolvedPlan> datasets) {
+    this(datasets, null);
+  }
Suggestion importance[1-10]: 4

__

Why: The suggestion to make maxout final and use explicit constructors improves immutability and clarity. The dual @RequiredArgsConstructor/@AllArgsConstructor pattern with mixed final/non-final fields is a valid design concern, though it works correctly in practice for this use case.

Low
Suggestions up to commit 60613ec
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix inconsistent RelBuilder state after maxout branch

After the if (node.getMaxout() != null) block, if maxout is set, LogicalSystemLimit
is pushed directly onto the builder but then context.relBuilder.peek() is called
without a prior build(). If maxout is null, context.relBuilder.build() was already
called inside the limit block, leaving the builder in an inconsistent state. The
final return context.relBuilder.peek() should be consistent regardless of the maxout
branch.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2472-2480]

 if (node.getMaxout() != null) {
   context.relBuilder.push(
       LogicalSystemLimit.create(
           LogicalSystemLimit.SystemLimitType.SUBSEARCH_MAXOUT,
           context.relBuilder.build(),
           context.relBuilder.literal(node.getMaxout())));
 }
 
-return context.relBuilder.peek();
+return context.relBuilder.build();
Suggestion importance[1-10]: 7

__

Why: This is a valid concern about RelBuilder state consistency. When maxout is set, context.relBuilder.build() is called inside the limit block, and then LogicalSystemLimit is pushed back. Calling peek() vs build() at the end could lead to inconsistent behavior depending on how the caller uses the returned node. Changing to build() would be more consistent with the pattern used elsewhere in the codebase.

Medium
Fix unified schema type resolution after coercion

The unified schema is built using only the type from the first node that introduces
a field, but after coerceUnionTypes runs, subsequent nodes may have the same field
with a different (coerced) type. The schema should use the coerced target type from
targetTypeMap rather than the first-seen type, otherwise the projection cast in
buildProjectionForUnion may cast to the wrong type. Consider passing the target type
map or re-deriving the common type here.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
   Map<String, RelDataType> seenFields = new HashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        // After coercion, all nodes sharing this field should have the same type;
+        // use the type from the first occurrence (post-coercion).
+        schema.add(new SchemaField(name, field.getType()));
+        seenFields.put(name, field.getType());
       }
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion claims there's a bug where the schema uses the first-seen type rather than the coerced type. However, buildUnifiedSchemaForUnion is called after coerceUnionTypes, so the input nodes already have coerced types. The first occurrence of a field in the coerced nodes will already have the target type, making this a non-issue in practice. The improved_code is also essentially identical to the existing_code.

Low
General
Fix temporal type widening to preserve timezone information

The temporal type widening logic returns TIMESTAMP when either type is TIMESTAMP,
even if the other is TIMESTAMP_WITH_LOCAL_TIME_ZONE. This means
TIMESTAMP_WITH_LOCAL_TIME_ZONE would be silently downcast to TIMESTAMP, potentially
losing timezone information. TIMESTAMP_WITH_LOCAL_TIME_ZONE should be considered
wider than plain TIMESTAMP.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [378-387]

 private static SqlTypeName getWiderTemporalTypeForUnion(SqlTypeName type1, SqlTypeName type2) {
-  if (type1 == SqlTypeName.TIMESTAMP || type2 == SqlTypeName.TIMESTAMP) {
-    return SqlTypeName.TIMESTAMP;
-  }
   if (type1 == SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE
       || type2 == SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE) {
     return SqlTypeName.TIMESTAMP_WITH_LOCAL_TIME_ZONE;
   }
+  if (type1 == SqlTypeName.TIMESTAMP || type2 == SqlTypeName.TIMESTAMP) {
+    return SqlTypeName.TIMESTAMP;
+  }
   return SqlTypeName.DATE;
 }
Suggestion importance[1-10]: 6

__

Why: This is a valid correctness issue: the current logic returns TIMESTAMP when one type is TIMESTAMP and the other is TIMESTAMP_WITH_LOCAL_TIME_ZONE, silently losing timezone information. Reordering the checks so TIMESTAMP_WITH_LOCAL_TIME_ZONE takes priority is the correct fix for proper type widening semantics.

Low
Make maxout field immutable to prevent inconsistent state

Having both @RequiredArgsConstructor and @AllArgsConstructor with a final field and
a non-final field creates two constructors: one with only datasets and one with both
datasets and maxout. However, maxout is non-final and mutable, which can lead to
inconsistent state. Consider making maxout final and using only @AllArgsConstructor,
or use a builder pattern to avoid accidental mutation.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
-@RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
+  private final Integer maxout;
 
-  private Integer maxout;
-
Suggestion importance[1-10]: 4

__

Why: Making maxout final is a reasonable immutability improvement for an AST node. However, the current design with @RequiredArgsConstructor and @AllArgsConstructor is intentional to allow creating a Union with or without maxout, and the attach method already creates a new Union instance. The risk of accidental mutation is low in practice.

Low
Suggestions up to commit 0ff52aa
CategorySuggestion                                                                                                                                    Impact
General
Parse qualified dataset names from individual tokens

When a qualifiedDatasetName contains a colon (e.g., cluster:index.alias), getText()
returns the raw concatenated text including the colon and dots without any parsing.
This raw string is passed directly as a qualified name, which may not be correctly
interpreted downstream. The grammar rule ident (COLON ident (DOT ident)*)? should be
parsed into its components (cluster, index, alias) rather than using the raw text.

ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java [1355-1362]

 for (OpenSearchPPLParser.UnionDatasetContext datasetCtx : ctx.unionDataset()) {
   if (datasetCtx.subSearch() != null) {
     datasets.add(visitSubSearch(datasetCtx.subSearch()));
   } else if (datasetCtx.qualifiedDatasetName() != null) {
-    String datasetName = datasetCtx.qualifiedDatasetName().getText();
-    datasets.add(new Relation(Collections.singletonList(AstDSL.qualifiedName(datasetName))));
+    OpenSearchPPLParser.QualifiedDatasetNameContext nameCtx = datasetCtx.qualifiedDatasetName();
+    // Build qualified name from individual ident tokens to handle cluster:index.alias correctly
+    List<String> parts = nameCtx.ident().stream()
+        .map(RuleContext::getText)
+        .collect(Collectors.toList());
+    datasets.add(new Relation(Collections.singletonList(
+        AstDSL.qualifiedName(parts.toArray(new String[0])))));
   }
 }
Suggestion importance[1-10]: 5

__

Why: The concern about getText() returning raw concatenated text for cluster:index.alias patterns is valid. Parsing individual ident tokens would be more robust for cross-cluster references, though it's unclear if the downstream qualifiedName handling already accounts for this format.

Low
Guard against relBuilder stack corruption per dataset

Each dataset's RelNode is built by calling context.relBuilder.build(), which pops
from the shared relBuilder stack. If a dataset's sub-plan pushes multiple nodes onto
the builder (e.g., due to nested unions or joins), only the top node is captured,
potentially leaving stale nodes on the stack and corrupting subsequent operations.
Consider saving and restoring the builder state, or verifying the stack depth before
and after each dataset visit.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2453-2457]

 for (UnresolvedPlan dataset : node.getDatasets()) {
+  int stackSizeBefore = context.relBuilder.size();
   UnresolvedPlan prunedDataset = dataset.accept(new EmptySourcePropagateVisitor(), null);
   prunedDataset.accept(this, context);
+  if (context.relBuilder.size() != stackSizeBefore + 1) {
+    throw new IllegalStateException(
+        "Expected exactly one RelNode pushed for union dataset, but stack changed by "
+            + (context.relBuilder.size() - stackSizeBefore));
+  }
   inputNodes.add(context.relBuilder.build());
 }
Suggestion importance[1-10]: 4

__

Why: The concern about relBuilder stack corruption is valid in theory, but the improved_code adds a runtime assertion rather than actually fixing the issue. This is a defensive check that could help catch bugs but doesn't address the root cause, and RelBuilder.size() may not be a public API.

Low
Remove redundant constructor annotation on Union class

Both @RequiredArgsConstructor and @AllArgsConstructor are present.
@RequiredArgsConstructor generates a constructor with only datasets (the final
field), while @AllArgsConstructor generates one with both datasets and maxout.
However, maxout is not final, so @RequiredArgsConstructor will not include it. This
means the single-argument constructor sets maxout to null implicitly, which is fine,
but having both annotations is redundant and confusing. Remove
@RequiredArgsConstructor since @AllArgsConstructor covers all cases, and add a
dedicated constructor for the datasets-only case if needed.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
-@RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
 
   private Integer maxout;
 
+  public Union(List<UnresolvedPlan> datasets) {
+    this.datasets = datasets;
+    this.maxout = null;
+  }
+
Suggestion importance[1-10]: 4

__

Why: Having both @RequiredArgsConstructor and @AllArgsConstructor is indeed redundant and potentially confusing. The suggestion is valid and improves code clarity, though the impact is minor since both annotations work correctly together in Lombok.

Low
Possible issue
Fix unified schema type resolution after coercion

The unified schema is built using the type from the first node that introduces a
field, but after coerceUnionTypes runs, subsequent nodes may have the same field
with a different (coerced) type. The schema should use the coerced target type from
targetTypeMap rather than the first-seen type, otherwise the projection in
buildProjectionForUnion will always detect a mismatch and add unnecessary CASTs.
Consider passing the target type map or re-deriving the common type here.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
   Map<String, RelDataType> seenFields = new HashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        schema.add(new SchemaField(name, field.getType()));
+        seenFields.put(name, field.getType());
+      } else if (!seenFields.get(name).equals(field.getType())) {
+        // After coercion, all inputs sharing a field name should have the same type.
+        // If they still differ, prefer the wider type already recorded.
       }
     }
   }
   return schema;
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion points out that buildUnifiedSchemaForUnion uses the first-seen type, but after coerceUnionTypes all inputs sharing a field should already have the same coerced type. The improved_code doesn't actually fix the described problem (it adds a comment but no real logic change), making the suggestion's impact minimal and the fix incomplete.

Low
Suggestions up to commit 12b2eb5
CategorySuggestion                                                                                                                                    Impact
General
Fix mutable field inconsistency in AST node

The class has both @RequiredArgsConstructor and @AllArgsConstructor.
@RequiredArgsConstructor generates a constructor for final fields only (i.e.,
datasets), while @AllArgsConstructor generates one for all fields (datasets and
maxout). This means there are two constructors: Union(List) and Union(List,
Integer). However, maxout is not final, so it can be mutated after construction,
which is inconsistent with the immutable design of other AST nodes. Consider making
maxout final and removing @RequiredArgsConstructor to avoid ambiguity, or use a
single constructor pattern.

core/src/main/java/org/opensearch/sql/ast/tree/Union.java [18-27]

 @Getter
 @ToString
 @EqualsAndHashCode(callSuper = false)
-@RequiredArgsConstructor
 @AllArgsConstructor
 public class Union extends UnresolvedPlan {
   private final List<UnresolvedPlan> datasets;
+  private final Integer maxout;
 
-  private Integer maxout;
+  public Union(List<UnresolvedPlan> datasets) {
+    this(datasets, null);
+  }
Suggestion importance[1-10]: 5

__

Why: Making maxout final and consolidating constructors improves immutability consistency with other AST nodes. The suggestion is valid and the improved_code correctly reflects the change, though the impact is moderate since the field is only set at construction time in practice.

Low
Use complete keyword list in anonymizer regex

The keywords regex negative lookahead is built by concatenating commandName directly
into the regex pattern. If commandName contains regex special characters, this could
break the pattern or cause incorrect anonymization. While current command names
(multisearch, union) are safe, this is fragile. Additionally, the keywords string
does not include union when called for multisearch and vice versa — meaning field
names that match the other command name could be incorrectly anonymized. Consider
using a fixed comprehensive keyword list instead of dynamically inserting the
command name.

ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java [790-794]

 private String anonymizeSubsearchCommand(String commandName, List<UnresolvedPlan> subsearches) {
   String keywords =
       "source|fields|where|stats|head|tail|sort|eval|rename|"
-          + commandName
-          + "|search|table|identifier|\\*\\*\\*";
+          + "multisearch|union|search|table|identifier|\\*\\*\\*";
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that the dynamic keyword list excludes the other command name (e.g., union is missing when anonymizing multisearch and vice versa), which could cause incorrect anonymization of field names matching those keywords. The fix is straightforward and the improved_code accurately reflects the change.

Low
Validate builder stack consistency per dataset

Each dataset is processed by calling prunedDataset.accept(this, context) and then
context.relBuilder.build(). However, if the visitor for a dataset internally pushes
multiple nodes onto the builder stack (e.g., for complex subsearches), build() only
pops one node. Additionally, if an exception occurs mid-iteration, the builder stack
may be left in an inconsistent state for subsequent datasets. Consider wrapping each
dataset processing in a try-catch or verifying the stack depth before and after each
dataset to ensure correctness.

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java [2453-2457]

 for (UnresolvedPlan dataset : node.getDatasets()) {
   UnresolvedPlan prunedDataset = dataset.accept(new EmptySourcePropagateVisitor(), null);
+  int stackSizeBefore = context.relBuilder.size();
   prunedDataset.accept(this, context);
+  if (context.relBuilder.size() != stackSizeBefore + 1) {
+    throw new IllegalStateException(
+        "Expected exactly one RelNode pushed for union dataset, but stack grew by "
+            + (context.relBuilder.size() - stackSizeBefore));
+  }
   inputNodes.add(context.relBuilder.build());
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion to validate the relBuilder stack depth is a reasonable defensive check, but context.relBuilder.size() may not be a public API in all Calcite versions. The concern about stack inconsistency is valid but the improved code may not compile depending on the Calcite API available.

Low
Ensure unified schema uses consistent coerced types

The unified schema is built using only the first occurrence of each field's type,
but coerceUnionTypes already computed common supertypes. If
buildUnifiedSchemaForUnion is called after coerceUnionTypes, the coerced types
should be reflected in the schema. However, since the schema is built from the first
node that contains a field, a field that only appears in a later node will use that
node's (possibly non-coerced) type. More critically, if two nodes have the same
field name but different types after coercion (which shouldn't happen after
coercion), the schema would be inconsistent. The real bug is that the schema type
for a field should be taken from the coerced inputs, but the first-seen type may not
match the coerced target type if the field only exists in one input. This is correct
for missing fields, but for shared fields the coerced type should be used
consistently — verify that the coerced inputs always agree on type before building
the schema.

core/src/main/java/org/opensearch/sql/calcite/SchemaUnifier.java [192-205]

 private static List<SchemaField> buildUnifiedSchemaForUnion(List<RelNode> nodes) {
   List<SchemaField> schema = new ArrayList<>();
-  Map<String, RelDataType> seenFields = new HashMap<>();
+  Map<String, RelDataType> seenFields = new LinkedHashMap<>();
 
   for (RelNode node : nodes) {
     for (RelDataTypeField field : node.getRowType().getFieldList()) {
-      if (!seenFields.containsKey(field.getName())) {
-        schema.add(new SchemaField(field.getName(), field.getType()));
-        seenFields.put(field.getName(), field.getType());
+      String name = field.getName();
+      if (!seenFields.containsKey(name)) {
+        seenFields.put(name, field.getType());
+      } else {
+        // After coercion, shared fields should have the same type; assert consistency
+        RelDataType existing = seenFields.get(name);
+        if (!existing.getSqlTypeName().equals(field.getType().getSqlTypeName())) {
+          // Prefer the wider/coerced type (should not happen after coerceUnionTypes)
+          seenFields.put(name, field.getType());
+        }
       }
     }
   }
+  seenFields.forEach((name, type) -> schema.add(new SchemaField(name, type)));
   return schema;
 }
Suggestion importance[1-10]: 3

__

Why: After coerceUnionTypes, all shared fields should already have the same type, so the scenario described is unlikely in practice. The suggestion adds complexity without addressing a clear bug, and the improved_code changes the behavior in a way that may not be necessary given the existing coercion pipeline.

Low

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 5fd5e1c

@srikanthpadakanti srikanthpadakanti force-pushed the feature/union-public branch 2 times, most recently from dcbc60f to 12b2eb5 Compare March 17, 2026 16:59
@github-actions
Copy link
Contributor

Persistent review updated to latest commit dcbc60f

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 12b2eb5

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 0ff52aa

@srikanthpadakanti
Copy link
Contributor Author

Hello @anasalkouz @mengweieric Please enforce the PR label and review this. Thanks

Also, the failing check sql-cli integration test uses a different Gradle version (8.14.2 vs 9.2.0). I believe this has nothing to do with my changes.

@anasalkouz anasalkouz added PPL Piped processing language feature labels Mar 17, 2026
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 60613ec

@mengweieric
Copy link
Collaborator

Please take a look at the CI failure.

@github-actions
Copy link
Contributor

Persistent review updated to latest commit f05fbdb

@srikanthpadakanti
Copy link
Contributor Author

Please take a look at the CI failure.

Took care of it. Please review. Thanks.

@github-actions
Copy link
Contributor

Persistent review updated to latest commit 3782336

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>

# Conflicts:
#	integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalciteExplainIT.java
#	integ-test/src/test/java/org/opensearch/sql/security/CalciteCrossClusterSearchIT.java
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 9ec4314

@srikanthpadakanti
Copy link
Contributor Author

Hello @mengweieric Can you please review this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Add Union command in PPL

3 participants