Skip to content

Conversation

@jtuglu1
Copy link
Contributor

@jtuglu1 jtuglu1 commented Jan 22, 2026

Description

  1. Doing some more digging, I found another unfortunate data difference between native batch (on-cluster) and Hadoop batch ingest. Ingesting a multi-value string ["a","b",null] with Hadoop is treated as ["a","b","null"] and in native batch, this correctly ingests to ["a","b",null]. This is difference appears to be a bug in all Druid versions(even latest). While this will not affect the current null handling migration, this will affect the future Hadoop -> native batch ingestion migration that will also need to take place.
  2. Hadoop doesn't allow for all-null columns in segments, it simply excludes them from the segment. I've updated the Hadoop job to support running druid.indexer.task.storeEmptyColumns=true, which allows us to store all NULL columns (how native/streaming ingest work today).

Related to:

Release note

Fix Hadoop null value handling to match native batch and allow v10 segment creation.

BREAKING CHANGES
  1. Hadoop ingests will now process multi-value string inputs like ["a","b",null] -> ["a","b",null] instead of ["a","b","null"] to match native batch ingestion.
  2. Hadoop ingests will now by default keep columns with all NULL values, instead of excluding them from the segment.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from d271c47 to b1313f3 Compare January 22, 2026 23:23
@jtuglu1 jtuglu1 requested a review from clintropolis January 22, 2026 23:23
@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch 2 times, most recently from 807fe75 to 788fd25 Compare January 23, 2026 00:54
@jtuglu1 jtuglu1 requested a review from maytasm January 23, 2026 01:27
@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from 788fd25 to 12073e9 Compare January 23, 2026 01:51
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good change to me. This behavior was reverted in #15190 but the additional changes in this patch look like they would deal with the original problem. Have you run this through a real world test to confirm that everything works as expected?

@jtuglu1
Copy link
Contributor Author

jtuglu1 commented Jan 23, 2026

@gianm another thing I discovered investigating this patch is that Hadoop by default does not create all-null columns in a segment(-Ddruid.indexer.task.storeEmptyColumns=false by default). Native batch in latest version does... . This is the key difference that showed up in the segment diff. #12279. Do you know why this is?

For example, if you were to ingest [null] multi-value string, native batch would see this correctly as null, whereas Hadoop would not even include this column in the segment which causes headaches when switching from Hadoop to native batch.

@gianm
Copy link
Contributor

gianm commented Jan 23, 2026

@gianm another thing I discovered investigating this patch is that Hadoop by default does not create all-null columns in a segment(-Ddruid.indexer.task.storeEmptyColumns=false by default). Native batch in latest version does... . This is the key difference that showed up in the segment diff. #12279. Do you know why this is?

There's a comment in IndexMergerV9 that says Hadoop indexing uses a constructor that doesn't support the storeEmptyColumns configuration yet. In that spot it's hard-coded to false. I suppose it would make more sense for that to be hard-coded to TaskConfig.DEFAULT_STORE_EMPTY_COLUMNS, i.e., true. It would make even more sense for it to respect the actual configuration, which would mean switching to some logic like the TaskToolboxFactory uses:

config.buildV10()
            ? indexMergerV10Factory.create()
            : indexMergerV9Factory.create(
                task.getContextValue(Tasks.STORE_EMPTY_COLUMNS_KEY, config.isStoreEmptyColumns())
            )

Rather than injecting the IndexMergerV9 directly.

@jtuglu1
Copy link
Contributor Author

jtuglu1 commented Jan 23, 2026

@gianm thanks, that's what I thought. I guess I just wanted to make sure there wasn't any critical piece of null merging that was missing/incompatible with Hadoop for both V9/V10.

@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch 2 times, most recently from e29849f to 9b5c25d Compare January 23, 2026 18:08
@jtuglu1 jtuglu1 requested a review from gianm January 23, 2026 18:15
@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from 9b5c25d to cd6b5eb Compare January 23, 2026 18:32
Comment on lines +318 to +339
DataSchema.builder()
.withDataSource("test_null_values")
.withParserMap(MAPPER.convertValue(
new StringInputRowParser(
new JSONParseSpec(
new TimestampSpec("ts", "iso", null),
new DimensionsSpec(DimensionsSpec.getDefaultSchemas(dimensions)),
null,
null,
null
),
null
),
Map.class
))
.withAggregators(new CountAggregatorFactory("count"))
.withGranularity(new UniformGranularitySpec(
Granularities.DAY,
Granularities.NONE,
ImmutableList.of(INTERVAL)
))
.withObjectMapper(MAPPER)

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation Note test

Invoking
Builder.withObjectMapper
should be avoided because it has been deprecated.
Comment on lines +318 to +332
DataSchema.builder()
.withDataSource("test_null_values")
.withParserMap(MAPPER.convertValue(
new StringInputRowParser(
new JSONParseSpec(
new TimestampSpec("ts", "iso", null),
new DimensionsSpec(DimensionsSpec.getDefaultSchemas(dimensions)),
null,
null,
null
),
null
),
Map.class
))

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation Note test

Invoking
Builder.withParserMap
should be avoided because it has been deprecated.
@jtuglu1 jtuglu1 force-pushed the fix-hadoop-and-native-batch-ingest-mvs-null-handling branch from cd6b5eb to 7f66e89 Compare January 23, 2026 21:07
// For string array, nulls are preserved so use ArrayList (ImmutableList doesn't support nulls)
Assert.assertEquals(
Lists.transform(SOME_STRING_ARRAY_VALUE, String::valueOf),
SOME_STRING_ARRAY_VALUE.stream().map(v -> v == null ? null : String.valueOf(v)).collect(Collectors.toList()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can use Evals.asString

DATA_SEGMENT_PUSHER = INJECTOR.getInstance(DataSegmentPusher.class);
PROPERTIES = INJECTOR.getInstance(Properties.class);

boolean buildV10 = Boolean.parseBoolean(PROPERTIES.getProperty(BUILD_V10_KEY, "false"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat, 👍 I ignored hadoop for v10 format stuff, thanks for adding

{
byte[] result = readBytes(in);
return StringUtils.fromUtf8(result);
return result == null ? null : StringUtils.fromUtf8(result);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we should make a StringUtils.fromUtf8Nullable that takes byte[] (we already have one for ByteBuffer)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add that

} else if (inputValue instanceof List) {
// guava's toString function fails on null objects, so please do not use it
return ((List<?>) inputValue).stream().map(String::valueOf).collect(Collectors.toList());
return ((List<?>) inputValue).stream().map(Evals::asString).collect(Collectors.toList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is a good change because the old behavior was wack, but I'm still tracing through to try to determine the actual impacts of this change.

Besides the hadoop impact, which you fix in this PR, this method seems like it will mostly impact callers of Row.getDimension as well as the toGroupKey method of this class since it calls getDimension.

luckily there are a relatively small number of 'production' callers of these methods

Row.getDimension:
Image

Rows.toGroupKey:
Image

which look mostly related to partitioning. I think we need to determine if the null -> 'null' coercion is important for these callers, and if so, do the coercion there. I'm uncertain currently but will keep trying to figure it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintropolis I can scope this PR to hadoop-only by creating a separate implementation of the Rows.* methods?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels worth figuring this out since the old code seems quite odd to be doing what it is at this layer, so I want to keep looking. It probably would be fine though if we can't figure it out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've summarized the breaking changes in the release notes section.

@jtuglu1
Copy link
Contributor Author

jtuglu1 commented Jan 23, 2026

@clintropolis @gianm Question: do you know why this is null? I had to update this to get it to work properly (actually ingest null-only columns).

@gianm
Copy link
Contributor

gianm commented Jan 24, 2026

@clintropolis @gianm Question: do you know why this is null? I had to update this to get it to work properly (actually ingest null-only columns).

I do not know. It's the DimensionsSpec parameter, but I don't recall what happens when that isn't provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants