Fix Hadoop multi-value string null value handling to match native batch #18944

jtuglu1 · 2026-01-22T23:19:11Z

Description

Doing some more digging, I found another unfortunate data difference between native batch (on-cluster) and Hadoop batch ingest. Ingesting a multi-value string ["a","b",null] with Hadoop is treated as ["a","b","null"] and in native batch, this correctly ingests to ["a","b",null]. This is difference appears to be a bug in all Druid versions(even latest). While this will not affect the current null handling migration, this will affect the future Hadoop -> native batch ingestion migration that will also need to take place.
Hadoop doesn't allow for all-null columns in segments, it simply excludes them from the segment. I've updated the Hadoop job to support running druid.indexer.task.storeEmptyColumns=true, which allows us to store all NULL columns (how native/streaming ingest work today).

Related to:

Release note

Fix Hadoop null value handling to match native batch and allow v10 segment creation.

BREAKING CHANGES

Hadoop ingests will now process multi-value string inputs like ["a","b",null] -> ["a","b",null] instead of ["a","b","null"] to match native batch ingestion.
Hadoop ingests will now by default keep columns with all NULL values, instead of excluding them from the segment.

This PR has:

gianm

Seems like a good change to me. This behavior was reverted in #15190 but the additional changes in this patch look like they would deal with the original problem. Have you run this through a real world test to confirm that everything works as expected?

indexing-hadoop/src/main/java/org/apache/druid/indexer/InputRowSerde.java

indexing-hadoop/src/test/java/org/apache/druid/indexer/InputRowSerdeTest.java

processing/src/main/java/org/apache/druid/data/input/Rows.java

jtuglu1 · 2026-01-23T08:02:49Z

@gianm another thing I discovered investigating this patch is that Hadoop by default does not create all-null columns in a segment(-Ddruid.indexer.task.storeEmptyColumns=false by default). Native batch in latest version does... . This is the key difference that showed up in the segment diff. #12279. Do you know why this is?

For example, if you were to ingest [null] multi-value string, native batch would see this correctly as null, whereas Hadoop would not even include this column in the segment which causes headaches when switching from Hadoop to native batch.

gianm · 2026-01-23T10:46:39Z

@gianm another thing I discovered investigating this patch is that Hadoop by default does not create all-null columns in a segment(-Ddruid.indexer.task.storeEmptyColumns=false by default). Native batch in latest version does... . This is the key difference that showed up in the segment diff. #12279. Do you know why this is?

There's a comment in IndexMergerV9 that says Hadoop indexing uses a constructor that doesn't support the storeEmptyColumns configuration yet. In that spot it's hard-coded to false. I suppose it would make more sense for that to be hard-coded to TaskConfig.DEFAULT_STORE_EMPTY_COLUMNS, i.e., true. It would make even more sense for it to respect the actual configuration, which would mean switching to some logic like the TaskToolboxFactory uses:

config.buildV10()
            ? indexMergerV10Factory.create()
            : indexMergerV9Factory.create(
                task.getContextValue(Tasks.STORE_EMPTY_COLUMNS_KEY, config.isStoreEmptyColumns())
            )

Rather than injecting the IndexMergerV9 directly.

jtuglu1 · 2026-01-23T16:19:27Z

@gianm thanks, that's what I thought. I guess I just wanted to make sure there wasn't any critical piece of null merging that was missing/incompatible with Hadoop for both V9/V10.

indexing-hadoop/src/test/java/org/apache/druid/indexer/HadoopNullValueIngestionTest.java

+            DataSchema.builder()
+                      .withDataSource("test_null_values")
+                      .withParserMap(MAPPER.convertValue(
+                          new StringInputRowParser(
+                              new JSONParseSpec(
+                                  new TimestampSpec("ts", "iso", null),
+                                  new DimensionsSpec(DimensionsSpec.getDefaultSchemas(dimensions)),
+                                  null,
+                                  null,
+                                  null
+                              ),
+                              null
+                          ),
+                          Map.class
+                      ))
+                      .withAggregators(new CountAggregatorFactory("count"))
+                      .withGranularity(new UniformGranularitySpec(
+                          Granularities.DAY,
+                          Granularities.NONE,
+                          ImmutableList.of(INTERVAL)
+                      ))
+                      .withObjectMapper(MAPPER)


indexing-hadoop/src/test/java/org/apache/druid/indexer/HadoopNullValueIngestionTest.java

+            DataSchema.builder()
+                      .withDataSource("test_null_values")
+                      .withParserMap(MAPPER.convertValue(
+                          new StringInputRowParser(
+                              new JSONParseSpec(
+                                  new TimestampSpec("ts", "iso", null),
+                                  new DimensionsSpec(DimensionsSpec.getDefaultSchemas(dimensions)),
+                                  null,
+                                  null,
+                                  null
+                              ),
+                              null
+                          ),
+                          Map.class
+                      ))


clintropolis · 2026-01-23T21:54:50Z

.../avro-extensions/src/test/java/org/apache/druid/data/input/AvroStreamInputRowParserTest.java

+      // For string array, nulls are preserved so use ArrayList (ImmutableList doesn't support nulls)
      Assert.assertEquals(
-          Lists.transform(SOME_STRING_ARRAY_VALUE, String::valueOf),
+          SOME_STRING_ARRAY_VALUE.stream().map(v -> v == null ? null : String.valueOf(v)).collect(Collectors.toList()),


nit: this can use Evals.asString

clintropolis · 2026-01-23T21:55:32Z

indexing-hadoop/src/main/java/org/apache/druid/indexer/HadoopDruidIndexerConfig.java

    DATA_SEGMENT_PUSHER = INJECTOR.getInstance(DataSegmentPusher.class);
    PROPERTIES = INJECTOR.getInstance(Properties.class);
+
+    boolean buildV10 = Boolean.parseBoolean(PROPERTIES.getProperty(BUILD_V10_KEY, "false"));


neat, 👍 I ignored hadoop for v10 format stuff, thanks for adding

clintropolis · 2026-01-23T21:58:45Z

indexing-hadoop/src/main/java/org/apache/druid/indexer/InputRowSerde.java

  {
    byte[] result = readBytes(in);
-    return StringUtils.fromUtf8(result);
+    return result == null ? null : StringUtils.fromUtf8(result);


i wonder if we should make a StringUtils.fromUtf8Nullable that takes byte[] (we already have one for ByteBuffer)

I can add that

clintropolis · 2026-01-23T22:04:28Z

processing/src/main/java/org/apache/druid/data/input/Rows.java

    } else if (inputValue instanceof List) {
      // guava's toString function fails on null objects, so please do not use it
-      return ((List<?>) inputValue).stream().map(String::valueOf).collect(Collectors.toList());
+      return ((List<?>) inputValue).stream().map(Evals::asString).collect(Collectors.toList());


i think this is a good change because the old behavior was wack, but I'm still tracing through to try to determine the actual impacts of this change.

Besides the hadoop impact, which you fix in this PR, this method seems like it will mostly impact callers of Row.getDimension as well as the toGroupKey method of this class since it calls getDimension.

luckily there are a relatively small number of 'production' callers of these methods

Row.getDimension:

Rows.toGroupKey:

which look mostly related to partitioning. I think we need to determine if the null -> 'null' coercion is important for these callers, and if so, do the coercion there. I'm uncertain currently but will keep trying to figure it out.

@clintropolis I can scope this PR to hadoop-only by creating a separate implementation of the Rows.* methods?

it feels worth figuring this out since the old code seems quite odd to be doing what it is at this layer, so I want to keep looking. It probably would be fine though if we can't figure it out?

I've summarized the breaking changes in the release notes section.

jtuglu1 · 2026-01-23T22:40:38Z

@clintropolis @gianm Question: do you know why this is null? I had to update this to get it to work properly (actually ingest null-only columns).

gianm · 2026-01-24T04:32:44Z

@clintropolis @gianm Question: do you know why this is null? I had to update this to get it to work properly (actually ingest null-only columns).

I do not know. It's the DimensionsSpec parameter, but I don't recall what happens when that isn't provided.

github-actions bot added the Area - Batch Ingestion label Jan 22, 2026

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from d271c47 to b1313f3 Compare January 22, 2026 23:23

jtuglu1 requested a review from clintropolis January 22, 2026 23:23

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch 2 times, most recently from 807fe75 to 788fd25 Compare January 23, 2026 00:54

jtuglu1 requested a review from maytasm January 23, 2026 01:27

jtuglu1 added the Release Notes label Jan 23, 2026

Fix Hadoop multi-value string null value handling to match native batch

12073e9

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from 788fd25 to 12073e9 Compare January 23, 2026 01:51

gianm reviewed Jan 23, 2026

View reviewed changes

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch 2 times, most recently from e29849f to 9b5c25d Compare January 23, 2026 18:08

jtuglu1 requested a review from gianm January 23, 2026 18:15

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from 9b5c25d to cd6b5eb Compare January 23, 2026 18:32

github-advanced-security bot found potential problems Jan 23, 2026

View reviewed changes

More changes

7f66e89

jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from cd6b5eb to 7f66e89 Compare January 23, 2026 21:07

clintropolis reviewed Jan 23, 2026

View reviewed changes

Fix Hadoop multi-value string null value handling to match native batch #18944

Are you sure you want to change the base?

Fix Hadoop multi-value string null value handling to match native batch #18944

Conversation

jtuglu1 commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release note

BREAKING CHANGES

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jtuglu1 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Jan 23, 2026

Uh oh!

jtuglu1 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Check notice

Uh oh!

Check notice

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtuglu1 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jtuglu1 commented Jan 22, 2026 •

edited

Loading

jtuglu1 commented Jan 23, 2026 •

edited

Loading

jtuglu1 commented Jan 23, 2026 •

edited

Loading

jtuglu1 commented Jan 23, 2026 •

edited

Loading