feat: add BigLake Iceberg support for BigQuery analytics plugin by caohy1988 · Pull Request #4750 · google/adk-python

caohy1988 · 2026-03-07T08:04:42Z

Summary

Adds BigLake Iceberg support to the BigQuery Agent Analytics Plugin, allowing users to log agent events to BigLake managed Apache Iceberg tables in BigQuery.

What changed

Schema & Table Creation:

New config option biglake_storage_uri on BigQueryLoggerConfig: when set (together with connection_id), the plugin automatically creates and configures a BigLake Iceberg table
Schema flattening for BigLake: BigLake Iceberg does not support JSON or nested RECORD types via streaming APIs. When biglake_storage_uri is set:
- JSON fields (content, attributes, latency_ms, content_parts.object_ref.details) → STRING
- RECORD/STRUCT fields (content_parts, object_ref) → STRING (JSON-serialized)
BigLakeConfiguration on table creation: Sets file_format=PARQUET, table_format=ICEBERG, storage_uri, and normalized connection_id on the BigQuery table
connection_id normalization: Accepts 3 formats and normalizes to the full resource path:
- location.connection → projects/{project}/locations/{location}/connections/{connection}
- project.location.connection → projects/{project}/locations/{location}/connections/{connection}
- projects/P/locations/L/connections/C → used as-is
Time partitioning: Skipped by default for BigLake Iceberg (preview feature); opt-in via biglake_time_partitioning=True
Config validation: Raises ValueError if biglake_storage_uri is set without connection_id

Write Path:

Native BigQuery tables → storage_write_api (unchanged, existing behavior)
BigLake Iceberg tables → legacy_streaming (new LegacyStreamingBatchProcessor)
Both Storage Write API v2 (Arrow format) and legacy streaming fail on BigLake Iceberg tables when the schema has nested RECORD fields (internal _colidentifier_iceberg_* errors). The fix is to flatten all complex types to STRING in the schema, then use legacy streaming which handles the data correctly.
LegacyStreamingBatchProcessor has the same queue/batch/flush/shutdown interface as BatchProcessor, using client.insert_rows_json() run in a ThreadPoolExecutor
_prepare_rows_json() serializes dict/list values to JSON strings and datetime objects to ISO format
Arrow schema creation is skipped for BigLake (not needed by legacy streaming)

E2E Test Results

Local agent test: ALL CHECKS PASSED

44 events logged across all 9 event types (USER_MESSAGE_RECEIVED, INVOCATION_STARTING/COMPLETED, AGENT_STARTING/COMPLETED, LLM_REQUEST/RESPONSE, TOOL_STARTING/COMPLETED)
BigLakeConfiguration verified (PARQUET, ICEBERG, correct storage_uri and connection_id)
No JSON fields in schema (all STRING)
No time partitioning
STRING content is queryable via PARSE_JSON()

Agent Engine test: Expected failure — the remote Agent Engine installs google-adk[bigquery] from PyPI (released version without BigLake changes). It created a standard BQ table and logged 33 events via Storage Write API. Will work once these changes are published.

Additional notes

Iceberg metadata refresh latency: Data written via legacy streaming is queryable immediately via BigQuery SQL. Open-source Iceberg engines may not see data for ~90 minutes due to metadata refresh cycles.
BigLake Iceberg views require different SQL (e.g., PARSE_JSON() on STRING columns instead of JSON_VALUE() on JSON columns); create_views=False is recommended.

Test plan

22 BigLake-specific tests in TestBigLakeIceberg class:
- Config defaults, is_biglake property, connection_id validation
- Schema JSON→STRING and RECORD→STRING conversion
- Arrow metadata verification
- Table creation with BigLakeConfiguration
- Time partitioning opt-in/skip
- Connection_id normalization (2-part, 3-part, full path, invalid formats)
- Non-biglake schema unchanged
- BigLake uses LegacyStreamingBatchProcessor (not BatchProcessor)
- Legacy processor _prepare_rows_json serialization (datetime, dict, list, None)
- Legacy processor calls insert_rows_json
- BigLake lazy setup skips Arrow schema
All 208 unit tests pass
Autoformatting applied
Local E2E test passes (44 events, all 9 types, all verifications)

1. Normalize connection_id to full resource path for BigLakeConfiguration (projects/{project}/locations/{loc}/connections/{name}). 2. Skip time partitioning for BigLake Iceberg by default (preview feature); add biglake_time_partitioning opt-in flag. 3. Document Storage Write API latency caveat for Iceberg metadata refresh (~90 min for open-source engine visibility). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…connection_id normalization _normalize_biglake_connection_id() now correctly parses "project.location.connection" (e.g. "myproj.us.my-conn") in addition to the two-part "location.connection" and full resource path forms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

google-cla · 2026-03-07T08:04:58Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

gemini-code-assist · 2026-03-07T08:05:07Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces robust support for BigLake managed Iceberg tables within the BigQuery analytics plugin. It enables users to configure their analytics to write data to Iceberg format in Google Cloud Storage, leveraging BigQuery's capabilities. The changes include necessary schema adjustments for Iceberg compatibility, standardized connection handling, and flexible partitioning options, ensuring seamless integration and data consistency across different analytics engines.

Highlights

BigLake Iceberg Support: Added a new biglake_storage_uri configuration option to BigQueryLoggerConfig to enable the creation of BigLake managed Iceberg tables.
Schema Transformation: Implemented automatic replacement of JSON schema fields with STRING types, as BigLake Iceberg does not support the JSON data type.
BigLake Configuration: Ensured that BigLakeConfiguration (including connection_id, storage_uri, file_format=PARQUET, and table_format=ICEBERG) is correctly set during BigLake table creation.
Connection ID Normalization: Introduced logic to normalize connection_id to its full resource path format (projects/{p}/locations/{l}/connections/{c}), accepting various short forms.
Time Partitioning Control: Configured BigLake Iceberg tables to skip time partitioning by default (as it's a preview feature), with an opt-in mechanism via biglake_time_partitioning.
Documentation Update: Added a note documenting the Storage Write API latency caveat for Iceberg metadata refresh, advising on potential delays for cross-engine visibility.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/google/adk/plugins/bigquery_agent_analytics_plugin.py
- Added biglake_storage_uri and biglake_time_partitioning fields to BigQueryLoggerConfig.
- Implemented _replace_json_with_string to convert JSON schema fields to STRING for BigLake compatibility.
- Created _normalize_biglake_connection_id to standardize BigLake connection IDs to full resource paths.
- Modified _get_events_schema to conditionally apply schema transformations for BigLake tables.
- Added validation to ensure connection_id is provided when biglake_storage_uri is set.
- Introduced an is_biglake property to easily check if BigLake functionality is enabled.
- Updated table creation logic in _ensure_schema_exists to apply BigLakeConfiguration and handle conditional time partitioning.
tests/unittests/plugins/test_bigquery_agent_analytics_plugin.py
- Added a new test suite TestBigLakeIceberg to cover all new BigLake Iceberg functionalities.
- Included tests for biglake_storage_uri configuration and the is_biglake property.
- Verified the requirement of connection_id when biglake_storage_uri is used.
- Tested the _replace_json_with_string function and its effect on BigLake schemas, ensuring no JSON fields remain.
- Validated that BigLake Arrow schemas do not contain google:sqlType:json metadata.
- Confirmed that BigLakeConfiguration is correctly set during BigLake table creation, including connection ID normalization.
- Tested the default skipping of time partitioning for BigLake tables and the opt-in mechanism.
- Added comprehensive tests for _normalize_biglake_connection_id covering various input formats and error cases.
- Ensured that non-BigLake schemas remain unchanged, preserving JSON fields.

Activity

No human activity (comments, reviews) has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

adk-bot · 2026-03-07T08:05:45Z

Response from ADK Triaging Agent

Hello @caohy1988, thank you for your contribution!

To proceed with the review, could you please address the following points from our contribution guidelines:

Sign the Contributor License Agreement (CLA): It looks like the CLA check has failed. Please visit https://cla.developers.google.com/ to sign it.
Associate an Issue: For new features, we require an associated GitHub issue. Could you please create one that describes this feature and link it to this PR?

Completing these steps will help us move forward with the review process. Thanks!

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces support for BigLake Iceberg tables in the BigQuery analytics plugin, which is a valuable enhancement. The changes are well-implemented, including new configuration options, schema adjustments for Iceberg compatibility, and robust connection ID normalization. The accompanying unit tests are thorough and cover the new functionality comprehensively. I have one minor suggestion to remove a redundant validation check to improve code clarity. Overall, this is a high-quality contribution.

gemini-code-assist · 2026-03-07T08:08:52Z

src/google/adk/plugins/bigquery_agent_analytics_plugin.py

      tbl.clustering_fields = self.config.clustering_fields
      tbl.labels = {_SCHEMA_VERSION_LABEL_KEY: _SCHEMA_VERSION}
+      if self.is_biglake:
+        from google.cloud.bigquery.table import BigLakeConfiguration
+


This validation check for connection_id is redundant. An equivalent check is already performed in the __init__ method (lines 1956-1959). Failing early during plugin instantiation is preferable to failing during lazy setup, as it's easier to debug. Removing this duplicate check will make the code cleaner.

caohy1988 · 2026-03-07T08:56:51Z

I looked into the Spark BigQuery connector path as a way to validate the “BigLake Iceberg supports high throughput streaming using the Storage Write API” claim.

Short conclusion: yes, the Spark connector is a valid proof path for Storage Write API -> BigLake Iceberg, but it is not a good in-process replacement for the current Python plugin writer.

Why:

The BigLake docs explicitly position Storage Write API support through connectors like Spark and Dataflow, not through a documented low-level Python AppendRows sample:
https://docs.cloud.google.com/biglake/docs/biglake-iceberg-tables-in-bigquery
The Spark BigQuery connector supports direct writes via Storage Write API using writeMethod=direct:
https://github.com/GoogleCloudDataproc/spark-bigquery-connector

Example shape:

(
    df.write
    .format("bigquery")
    .option("writeMethod", "direct")
    .option("writeAtLeastOnce", "true")
    .mode("append")
    .save("project.dataset.biglake_iceberg_table")
)

Recommendation:

Use Spark/Dataflow as a standalone validation path or external ingestion backend.
Do not try to swap the current BigQueryAgentAnalyticsPlugin writer to Spark in-process. That would be a major architecture change: JVM/Spark runtime, connector jar management, process startup cost, and a very different operational model from the current lightweight Python callback path.
If we want to prove whether Storage Write API can work for this use case, the next practical step is a small standalone Spark or Dataflow repro that writes rows shaped like the plugin events into an existing BigLake Iceberg table.
If that repro succeeds while the current raw Python Storage Write API path still fails with _colidentifier_iceberg_1, then the correct conclusion is likely: connector-supported path works, but the current low-level direct client path used by the plugin is unsupported or at least undocumented.

One additional caveat from the docs: even when streamed writes succeed, Iceberg metadata visibility for open-source engines may lag by up to ~90 minutes, so this should not be treated as immediate cross-engine freshness.

Given that, I would not change the plugin implementation to Spark. I would treat Spark/Dataflow as:

a validation harness for the product capability, and
potentially a separate ingestion architecture if BigLake Iceberg is a hard requirement.

caohy1988 · 2026-03-07T09:05:58Z

After looking at the documented support surface and the current E2E result, my recommendation is to keep this PR as a minimal MVP for BigLake support.

Recommended default behavior:

native BigQuery table -> storage_write_api
BigLake Iceberg table -> legacy_streaming

Why I think this is the right scope for this PR:

It solves the actual user problem in [Feature Request] Support writing analytics events to BigLake tables in BigQuery Agent Analytics plugin #4747: support writing analytics events to BigLake tables.
It avoids blocking the feature on the current raw Python Storage Write API failure (_colidentifier_iceberg_1).
It keeps the change small and easy to reason about instead of introducing multiple new backends / batching strategies / ingestion modes in one PR.
It preserves the current best path for native BigQuery tables, where storage_write_api is already the right default.

Why this split makes sense technically:

For native BigQuery tables, storage_write_api is the existing fast path and should remain the default.
For BigLake Iceberg, we have evidence that some supported write paths work (legacy_streaming, DML, load jobs), while the current low-level direct Storage Write API path does not.
The BigLake docs mention Storage Write API support through connectors like Spark/Dataflow, but that is not the same thing as the current raw Python AppendRows path used by this plugin.

So for this PR, I would explicitly avoid expanding scope into:

multiple BigLake write backends
load-job batching
DML fallback logic
trying to force raw Storage Write API support to work

Those can all be follow-up work if needed. For MVP, the cleanest path is:

add BigLake Iceberg table support
route BigLake writes to legacy_streaming
document that raw Python Storage Write API remains unresolved for BigLake Iceberg

That gives users a working feature now, keeps the PR minimal, and avoids overfitting to an undocumented / currently failing backend path.

caohy1988 · 2026-03-07T09:09:36Z

I think this should be tracked as a Google-internal / product bug, separate from this PR.

Reason:

On the same BigLake Iceberg table, the following write paths work: legacy_streaming, DML INSERT, and batch load / LOAD DATA.
The raw Python Storage Write API path used by this plugin fails with:
INVALID_ARGUMENT: Input schema has missing required field: _colidentifier_iceberg_1
That strongly suggests the issue is not "BigLake Iceberg cannot be written", but specifically that the low-level direct AppendRows path is either unsupported, partially supported, or currently broken for managed BigLake Iceberg tables.

I would recommend filing a product bug with a minimal repro like this:

Title:
Raw BigQuery Storage Write API AppendRows to managed BigLake Iceberg table fails with _colidentifier_iceberg_1

Repro summary:

Create a managed BigLake Iceberg table in BigQuery with BigLakeConfiguration:
- storage_uri=gs://...
- file_format=PARQUET
- table_format=ICEBERG
- valid BigLake connection_id
Use google-cloud-bigquery-storage Python client (BigQueryWriteAsyncClient) to append rows to:
projects/{project}/datasets/{dataset}/tables/{table}/_default
Send a standard Arrow AppendRowsRequest with a schema matching the user-visible table schema.
Observe failure:
INVALID_ARGUMENT: Input schema has missing required field: _colidentifier_iceberg_1
Compare against the same table using:
- legacy streaming inserts
- DML INSERT
- batch load / LOAD DATA
  All of those succeed.

Questions for product team:

Is raw direct Storage Write API AppendRows to managed BigLake Iceberg tables expected to be supported?
If yes, is _colidentifier_iceberg_1 a backend bug or is there an undocumented client requirement?
If no, can docs be clarified to distinguish connector-mediated Storage Write API support (Spark/Dataflow) from raw direct client support?

Given that, I would not block this PR on raw Storage Write API. I would keep the PR minimal and use:

native BigQuery table -> storage_write_api
BigLake Iceberg table -> legacy_streaming

That gives users a working MVP now, while the raw AppendRows behavior is investigated with the product team.

The Storage Write API v2 (Arrow format) cannot write to BigLake Iceberg tables due to internal _colidentifier_iceberg_1 columns. Route BigLake writes to the legacy streaming API (insert_rows_json) which handles these transparently. - Add LegacyStreamingBatchProcessor with same queue/batch interface - BigLake: create LegacyStreamingBatchProcessor in _get_loop_state() - Non-BigLake: unchanged, uses Storage Write API (BatchProcessor) - Skip Arrow schema creation for BigLake (not needed) - Update _LoopState to accept Union processor type - Add 5 tests for legacy streaming processor and routing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

BigLake Iceberg tables cannot handle nested RECORD fields via any streaming API (both Storage Write API and legacy streaming fail with _colidentifier_iceberg errors on RECORD positions). Changes: - _replace_json_with_string now also flattens RECORD/STRUCT fields to STRING (JSON-serialized) for BigLake Iceberg - LegacyStreamingBatchProcessor._prepare_rows_json serializes dict/list values to JSON strings - Updated E2E test scripts to verify flattened schema - Local E2E test passes: 44 events, all 9 event types, all checks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

caohy1988 and others added 3 commits March 6, 2026 23:48

adk-bot added the services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc label Mar 7, 2026

docs: update connection_id comment to reflect all accepted formats

db1c99e

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist bot reviewed Mar 7, 2026

View reviewed changes

caohy1988 and others added 2 commits March 7, 2026 01:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add BigLake Iceberg support for BigQuery analytics plugin#4750

feat: add BigLake Iceberg support for BigQuery analytics plugin#4750
caohy1988 wants to merge 6 commits intogoogle:mainfrom
caohy1988:feat/biglake-iceberg-support

caohy1988 commented Mar 7, 2026 •

edited

Loading

Uh oh!

google-cla bot commented Mar 7, 2026

Uh oh!

gemini-code-assist bot commented Mar 7, 2026

Uh oh!

adk-bot commented Mar 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 7, 2026

Uh oh!

caohy1988 commented Mar 7, 2026

Uh oh!

caohy1988 commented Mar 7, 2026

Uh oh!

caohy1988 commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

caohy1988 commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

E2E Test Results

Additional notes

Test plan

Related

Uh oh!

google-cla bot commented Mar 7, 2026

Uh oh!

gemini-code-assist bot commented Mar 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

adk-bot commented Mar 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

caohy1988 commented Mar 7, 2026

Uh oh!

caohy1988 commented Mar 7, 2026

Uh oh!

caohy1988 commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

caohy1988 commented Mar 7, 2026 •

edited

Loading