-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Fix Hadoop multi-value string null value handling to match native batch #18944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix Hadoop multi-value string null value handling to match native batch #18944
Conversation
d271c47 to
b1313f3
Compare
807fe75 to
788fd25
Compare
788fd25 to
12073e9
Compare
gianm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good change to me. This behavior was reverted in #15190 but the additional changes in this patch look like they would deal with the original problem. Have you run this through a real world test to confirm that everything works as expected?
indexing-hadoop/src/main/java/org/apache/druid/indexer/InputRowSerde.java
Outdated
Show resolved
Hide resolved
indexing-hadoop/src/test/java/org/apache/druid/indexer/InputRowSerdeTest.java
Outdated
Show resolved
Hide resolved
|
@gianm another thing I discovered investigating this patch is that Hadoop by default does not create all-null columns in a segment( For example, if you were to ingest |
There's a comment in config.buildV10()
? indexMergerV10Factory.create()
: indexMergerV9Factory.create(
task.getContextValue(Tasks.STORE_EMPTY_COLUMNS_KEY, config.isStoreEmptyColumns())
)Rather than injecting the |
|
@gianm thanks, that's what I thought. I guess I just wanted to make sure there wasn't any critical piece of null merging that was missing/incompatible with Hadoop for both V9/V10. |
e29849f to
9b5c25d
Compare
9b5c25d to
cd6b5eb
Compare
| DataSchema.builder() | ||
| .withDataSource("test_null_values") | ||
| .withParserMap(MAPPER.convertValue( | ||
| new StringInputRowParser( | ||
| new JSONParseSpec( | ||
| new TimestampSpec("ts", "iso", null), | ||
| new DimensionsSpec(DimensionsSpec.getDefaultSchemas(dimensions)), | ||
| null, | ||
| null, | ||
| null | ||
| ), | ||
| null | ||
| ), | ||
| Map.class | ||
| )) | ||
| .withAggregators(new CountAggregatorFactory("count")) | ||
| .withGranularity(new UniformGranularitySpec( | ||
| Granularities.DAY, | ||
| Granularities.NONE, | ||
| ImmutableList.of(INTERVAL) | ||
| )) | ||
| .withObjectMapper(MAPPER) |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation Note test
Builder.withObjectMapper
| DataSchema.builder() | ||
| .withDataSource("test_null_values") | ||
| .withParserMap(MAPPER.convertValue( | ||
| new StringInputRowParser( | ||
| new JSONParseSpec( | ||
| new TimestampSpec("ts", "iso", null), | ||
| new DimensionsSpec(DimensionsSpec.getDefaultSchemas(dimensions)), | ||
| null, | ||
| null, | ||
| null | ||
| ), | ||
| null | ||
| ), | ||
| Map.class | ||
| )) |
Check notice
Code scanning / CodeQL
Deprecated method or constructor invocation Note test
Builder.withParserMap
cd6b5eb to
7f66e89
Compare
| // For string array, nulls are preserved so use ArrayList (ImmutableList doesn't support nulls) | ||
| Assert.assertEquals( | ||
| Lists.transform(SOME_STRING_ARRAY_VALUE, String::valueOf), | ||
| SOME_STRING_ARRAY_VALUE.stream().map(v -> v == null ? null : String.valueOf(v)).collect(Collectors.toList()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this can use Evals.asString
| DATA_SEGMENT_PUSHER = INJECTOR.getInstance(DataSegmentPusher.class); | ||
| PROPERTIES = INJECTOR.getInstance(Properties.class); | ||
|
|
||
| boolean buildV10 = Boolean.parseBoolean(PROPERTIES.getProperty(BUILD_V10_KEY, "false")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat, 👍 I ignored hadoop for v10 format stuff, thanks for adding
| { | ||
| byte[] result = readBytes(in); | ||
| return StringUtils.fromUtf8(result); | ||
| return result == null ? null : StringUtils.fromUtf8(result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if we should make a StringUtils.fromUtf8Nullable that takes byte[] (we already have one for ByteBuffer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add that
| } else if (inputValue instanceof List) { | ||
| // guava's toString function fails on null objects, so please do not use it | ||
| return ((List<?>) inputValue).stream().map(String::valueOf).collect(Collectors.toList()); | ||
| return ((List<?>) inputValue).stream().map(Evals::asString).collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is a good change because the old behavior was wack, but I'm still tracing through to try to determine the actual impacts of this change.
Besides the hadoop impact, which you fix in this PR, this method seems like it will mostly impact callers of Row.getDimension as well as the toGroupKey method of this class since it calls getDimension.
luckily there are a relatively small number of 'production' callers of these methods
which look mostly related to partitioning. I think we need to determine if the null -> 'null' coercion is important for these callers, and if so, do the coercion there. I'm uncertain currently but will keep trying to figure it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clintropolis I can scope this PR to hadoop-only by creating a separate implementation of the Rows.* methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels worth figuring this out since the old code seems quite odd to be doing what it is at this layer, so I want to keep looking. It probably would be fine though if we can't figure it out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've summarized the breaking changes in the release notes section.
|
@clintropolis @gianm Question: do you know why this is |
I do not know. It's the DimensionsSpec parameter, but I don't recall what happens when that isn't provided. |


Description
["a","b",null]with Hadoop is treated as["a","b","null"]and in native batch, this correctly ingests to["a","b",null]. This is difference appears to be a bug in all Druid versions(even latest). While this will not affect the current null handling migration, this will affect the future Hadoop -> native batch ingestion migration that will also need to take place.druid.indexer.task.storeEmptyColumns=true, which allows us to store allNULLcolumns (how native/streaming ingest work today).Related to:
Release note
Fix Hadoop null value handling to match native batch and allow v10 segment creation.
BREAKING CHANGES
["a","b",null]->["a","b",null]instead of["a","b","null"]to match native batch ingestion.NULLvalues, instead of excluding them from the segment.This PR has: