Skip to content

feat: metrics for parquet writer#651

Merged
wgtmac merged 3 commits into
apache:mainfrom
WZhuo:parquet_metrics
May 26, 2026
Merged

feat: metrics for parquet writer#651
wgtmac merged 3 commits into
apache:mainfrom
WZhuo:parquet_metrics

Conversation

@WZhuo
Copy link
Copy Markdown
Contributor

@WZhuo WZhuo commented May 11, 2026

No description provided.

@WZhuo WZhuo force-pushed the parquet_metrics branch from 63f7e38 to 4ec4aae Compare May 21, 2026 04:11
@WZhuo WZhuo force-pushed the parquet_metrics branch from 4ec4aae to 01f825d Compare May 21, 2026 09:05
Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of metrics edge cases to fix before this lands.

Comment thread src/iceberg/parquet/parquet_metrics.cc
Comment thread src/iceberg/parquet/parquet_writer.cc
@WZhuo WZhuo force-pushed the parquet_metrics branch from 654a0da to 6d25fd4 Compare May 25, 2026 03:07
@WZhuo WZhuo force-pushed the parquet_metrics branch from 6d25fd4 to a8c650d Compare May 25, 2026 03:15
Comment thread src/iceberg/metrics.h
/// lower/upper bounds for a specific field identified by its field_id.
struct ICEBERG_EXPORT FieldMetrics {
/// \brief The field ID this metrics belongs to.
int32_t field_id;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to initialize it by an invalid value just in case undefined behavior?

/// If provided, these take precedence over footer statistics.
/// \return Result containing the computed Metrics or an error.
static Result<Metrics> GetMetrics(
const Schema& schema, const ::parquet::SchemaDescriptor& parquet_schema,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, we cannot expose symbols like ::parquet:: which belongs to thirdparty dependencies. So we need to rename this header file to parquet_metrics_internal.h to not install it. (note that the source file does not need to rename)

auto min_span = std::span<const uint8_t>(
reinterpret_cast<const uint8_t*>(min_bytes.data()), min_bytes.size());
ICEBERG_ASSIGN_OR_RAISE(auto min_value,
Conversions::FromBytes(iceberg_type, min_span));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stats->EncodeMin() is encoded by Parquet plain encoding but Conversions::FromBytes expects Iceberg single value serialization. This conversion may work coincidently for some primitive types. We need to fix it by creating a mapping between Parquet and Iceberg types and then use explicit cast to obtain typed Parquet values to create correct literals.

}

if (!lower_bound.has_value() || !upper_bound.has_value() || lower_bound->IsNaN() ||
upper_bound->IsNaN()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the above conversion has been fixed, we can move NaN check only for floating types.

}

ICEBERG_ASSIGN_OR_RAISE(auto truncated_lower,
TruncateUtils::TruncateLowerBound(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, truncation can only be done on string types.

Comment on lines +179 to +182
if (!truncated.has_value()) {
return std::nullopt;
}
return std::optional<Literal>(Literal::String(std::move(truncated.value())));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can directly write return truncated.transform([](std::string t) { return Literal::String(std::move(t)); }); to make it compact.

const auto& data = std::get<std::vector<uint8_t>>(literal.value());
if (static_cast<int32_t>(data.size()) <= width) {
return literal;
return std::optional<Literal>(literal);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need an extra wrapper of std::optional here, isn't it? Same for other changes in this file.

}
}
return InvalidArgument("Cannot truncate upper bound for binary: all bytes are 0xFF");
return std::nullopt;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this error message is worth keeping.

.null_value_count = fm.null_value_count,
.nan_value_count = fm.nan_value_count};

if (truncate_length > 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check primitive_type first to skip types that do not need truncation?


namespace iceberg::test {

class ParquetMetricsTest : public MetricsTestBase, public ::testing::Test {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test suite is broad, but several cases are too forgiving and would miss real bugs.

  • Decimal bounds are under-tested. MetricsForDecimals covers decimal-as-INT32, INT64, and fixed, but only checks that bounds exist. It should assert exact lower/upper values for each physical layout. This is exactly how the EncodeMin() vs Iceberg Conversions::FromBytes bug can slip through.
  • “Multiple row group” tests may not actually test multiple row groups. MetricsForTopLevelWithMultipleRowGroup and nested equivalent write 201 rows, but Parquet’s SupportsSmallRowGroups() returns false, so row group count is not verified. Force multiple row groups or remove the claim.
  • field_metrics override path is not tested. ParquetMetrics::GetMetrics supports precomputed FieldMetrics, but parquet_writer.cc always passes {}. Add direct tests where non-empty field_metrics overrides footer stats, carries NaN counts, and applies truncation.
  • Bounds assertions ignore literal type. AssertBounds receives a PrimitiveType but does not validate the returned Literal type. So date vs int, timestamp vs long, etc. could pass accidentally. Assert the literal type as well as the value.
  • Column sizes are barely verified. Several tests only check column_sizes.empty() or !empty(). They should assert exact field-id sets, especially for none, counts, nested-field overrides, and list/map exclusion behavior.
  • Truncate integration is incomplete. Utility tests cover std::nullopt for impossible upper bounds, but there is no Parquet metrics test proving that an unrepresentable truncated upper bound omits only the upper bound while preserving counts/lower bound.

Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spotted various issues. Since this requires understanding of parquet internals, let me merge this as is and I will continue working on this. Thanks @WZhuo for initiating this!

@wgtmac wgtmac merged commit 97ea870 into apache:main May 26, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants