feat: metrics for parquet writer#651
Conversation
wgtmac
left a comment
There was a problem hiding this comment.
A couple of metrics edge cases to fix before this lands.
| /// lower/upper bounds for a specific field identified by its field_id. | ||
| struct ICEBERG_EXPORT FieldMetrics { | ||
| /// \brief The field ID this metrics belongs to. | ||
| int32_t field_id; |
There was a problem hiding this comment.
Do we need to initialize it by an invalid value just in case undefined behavior?
| /// If provided, these take precedence over footer statistics. | ||
| /// \return Result containing the computed Metrics or an error. | ||
| static Result<Metrics> GetMetrics( | ||
| const Schema& schema, const ::parquet::SchemaDescriptor& parquet_schema, |
There was a problem hiding this comment.
In principle, we cannot expose symbols like ::parquet:: which belongs to thirdparty dependencies. So we need to rename this header file to parquet_metrics_internal.h to not install it. (note that the source file does not need to rename)
| auto min_span = std::span<const uint8_t>( | ||
| reinterpret_cast<const uint8_t*>(min_bytes.data()), min_bytes.size()); | ||
| ICEBERG_ASSIGN_OR_RAISE(auto min_value, | ||
| Conversions::FromBytes(iceberg_type, min_span)); |
There was a problem hiding this comment.
stats->EncodeMin() is encoded by Parquet plain encoding but Conversions::FromBytes expects Iceberg single value serialization. This conversion may work coincidently for some primitive types. We need to fix it by creating a mapping between Parquet and Iceberg types and then use explicit cast to obtain typed Parquet values to create correct literals.
| } | ||
|
|
||
| if (!lower_bound.has_value() || !upper_bound.has_value() || lower_bound->IsNaN() || | ||
| upper_bound->IsNaN()) { |
There was a problem hiding this comment.
If the above conversion has been fixed, we can move NaN check only for floating types.
| } | ||
|
|
||
| ICEBERG_ASSIGN_OR_RAISE(auto truncated_lower, | ||
| TruncateUtils::TruncateLowerBound( |
There was a problem hiding this comment.
Ditto, truncation can only be done on string types.
| if (!truncated.has_value()) { | ||
| return std::nullopt; | ||
| } | ||
| return std::optional<Literal>(Literal::String(std::move(truncated.value()))); |
There was a problem hiding this comment.
We can directly write return truncated.transform([](std::string t) { return Literal::String(std::move(t)); }); to make it compact.
| const auto& data = std::get<std::vector<uint8_t>>(literal.value()); | ||
| if (static_cast<int32_t>(data.size()) <= width) { | ||
| return literal; | ||
| return std::optional<Literal>(literal); |
There was a problem hiding this comment.
We don't need an extra wrapper of std::optional here, isn't it? Same for other changes in this file.
| } | ||
| } | ||
| return InvalidArgument("Cannot truncate upper bound for binary: all bytes are 0xFF"); | ||
| return std::nullopt; |
There was a problem hiding this comment.
I think this error message is worth keeping.
| .null_value_count = fm.null_value_count, | ||
| .nan_value_count = fm.nan_value_count}; | ||
|
|
||
| if (truncate_length > 0) { |
There was a problem hiding this comment.
Can we check primitive_type first to skip types that do not need truncation?
|
|
||
| namespace iceberg::test { | ||
|
|
||
| class ParquetMetricsTest : public MetricsTestBase, public ::testing::Test { |
There was a problem hiding this comment.
The test suite is broad, but several cases are too forgiving and would miss real bugs.
- Decimal bounds are under-tested. MetricsForDecimals covers decimal-as-INT32, INT64, and fixed, but only checks that bounds exist. It should assert exact lower/upper values for each physical layout. This is exactly how the
EncodeMin()vs IcebergConversions::FromBytesbug can slip through. - “Multiple row group” tests may not actually test multiple row groups. MetricsForTopLevelWithMultipleRowGroup and nested equivalent write 201 rows, but Parquet’s
SupportsSmallRowGroups()returns false, so row group count is not verified. Force multiple row groups or remove the claim. field_metricsoverride path is not tested.ParquetMetrics::GetMetricssupports precomputedFieldMetrics, but parquet_writer.cc always passes{}. Add direct tests where non-emptyfield_metricsoverrides footer stats, carries NaN counts, and applies truncation.- Bounds assertions ignore literal type. AssertBounds receives a
PrimitiveTypebut does not validate the returnedLiteraltype. So date vs int, timestamp vs long, etc. could pass accidentally. Assert the literal type as well as the value. - Column sizes are barely verified. Several tests only check
column_sizes.empty()or!empty(). They should assert exact field-id sets, especially fornone,counts, nested-field overrides, and list/map exclusion behavior. - Truncate integration is incomplete. Utility tests cover
std::nulloptfor impossible upper bounds, but there is no Parquet metrics test proving that an unrepresentable truncated upper bound omits only the upper bound while preserving counts/lower bound.
No description provided.