feat(data-drains): add GCS, Azure Blob, BigQuery, Snowflake, and Datadog destinations by waleedlatif1 · Pull Request #4552 · simstudioai/sim

waleedlatif1 · 2026-05-11T02:51:00Z

Summary

Add 5 new data-drain destinations: Google Cloud Storage, Azure Blob Storage, Google BigQuery, Snowflake, and Datadog Logs
Each destination implements test() + openSession()/deliver() with provider-spec-correct auth, retry, byte-accurate size guards, and abort-signal forwarding
Snowflake: key-pair JWT (with account-suffix stripping), 202-async polling, identifier quoting, PARSE_JSON bindings, 16 MB VARIANT guard
BigQuery: tabledata.insertAll with drainId-prefixed insertId dedup, partial-failure surfacing, 401 token refresh + 5xx/429 retry
Datadog: v2 logs intake with gzip, per-entry 1 MB + per-request 5 MB / 1000-entry guards, all sites including ap2
GCS: JSON API media uploads with shared retry helper, GCS-spec bucket-name validation (rejects goog/google)
Azure Blob: @azure/storage-blob SDK with sovereign-cloud endpointSuffix support
Settings UI: form specs for each destination, icons, search/UI polish
Docs: full destination sections in enterprise/data-drains.mdx
57/57 destination tests pass; `tsc --noEmit` clean

Type of Change

New feature

Testing

Tested manually. New unit tests cover schema validation, retry paths, byte-accurate size guards, gzip, sovereign-cloud routing, and partial-failure handling per destination.

Checklist

Code follows project style guidelines
Self-reviewed my changes
Tests added/updated and passing
No new warnings introduced
I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

vercel · 2026-05-11T02:51:05Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
docs	Skipped		May 12, 2026 0:25am

cursor · 2026-05-11T02:51:09Z

PR Summary

High Risk
High risk: adds multiple new external delivery integrations (cloud storage, warehouses, observability) and expands destination validation/SSRF and credential-handling paths, which can impact data export reliability and security if misconfigured.

Overview
Data Drains can now export to five additional destinations: Google Cloud Storage, Azure Blob Storage, Google BigQuery, Snowflake, and Datadog Logs, including new backend destination implementations (auth, delivery, retries, size/limit guards, and test connection) plus unit test coverage.

The settings UI is extended with new destination forms/options and icons, and the API contracts/schemas are expanded to support the new destination types with significantly stricter validation (bucket/account/table identifier rules, allowlisted Azure endpoint suffixes, BigQuery/Snowflake identifiers).

Separately, S3 and webhook destinations are hardened: S3 key/metadata byte-limit checks and stronger endpoint/region/bucket validation; webhook validation tightens (SSRF-safe URL rules, reserved/safer signature header constraints, longer signing secret, bearer token sanitation) and shares common retry/backoff helpers. Docs are updated to describe all supported destinations, and a DB migration adds the new enum values.

^{Reviewed by Cursor Bugbot for commit 58a35bf. Configure here.}

greptile-apps · 2026-05-11T02:54:50Z

Greptile Summary

This PR adds five new data-drain destinations (GCS, Azure Blob Storage, BigQuery, Snowflake, and Datadog) alongside full API contract schemas, a settings UI, database migration, and comprehensive unit tests. All destinations implement test() + openSession()/deliver() with provider-correct auth, retry backoff, byte-accurate size guards, and abort-signal forwarding.

New destinations: each implemented with per-attempt timeouts, abort-signal–aware sleeps, connection-pool-safe body draining, and provider-specific constraints (Snowflake key-pair JWT + 202 async polling; BigQuery streaming inserts with dedup insertId; Datadog gzip + uncompressed/wire size guards; GCS JSON API media uploads; Azure Blob SDK with sovereign-cloud endpoint suffix allowlist).
Shared utilities (utils.ts): parseNdjsonObjects, parseServiceAccount/refineServiceAccountJson, backoffWithJitter, parseRetryAfter, and sleepUntilAborted are extracted and reused across all destinations.
Contract + UI parity: API contract schemas, runtime Zod schemas, and form isComplete guards are aligned for all destinations including the previously-fixed signingSecret >= 32, endpointSuffix sovereign-cloud field, and accountKey base64 validation.

Confidence Score: 4/5

Safe to merge after resolving the outstanding Snowflake test() validation gap flagged in the prior review round.

All issues identified in the previous review rounds appear fixed in the current HEAD. The one remaining open gap is Snowflake's test() method, which only executes SELECT 1 and does not reference the user-configured database, schema, or table. A misconfigured table passes the connection test silently and surfaces a confusing error only on the first real delivery run.

apps/sim/lib/data-drains/destinations/snowflake.ts — the test() method needs to validate the configured target table, not just connectivity.

Important Files Changed

Filename	Overview
apps/sim/lib/data-drains/destinations/snowflake.ts	New Snowflake destination with key-pair JWT auth, 202 async polling, and VARIANT bindings. Retry and poll loops use `sleepUntilAborted`, per-attempt timeouts, and body draining. The `test()` method only runs `SELECT 1` and does not exercise the configured database/schema/table, meaning a misconfigured table silently passes the connection test.
apps/sim/lib/data-drains/destinations/bigquery.ts	New BigQuery streaming-insert destination. Uses `tables.get` probe in `test()`, drainId-prefixed insertIds for dedup, per-row/per-request size guards, 401 token-refresh recovery, and correct per-attempt timeouts on both `postInsertAll` and the `test()` probe.
apps/sim/lib/data-drains/destinations/gcs.ts	New GCS JSON API destination. `fetchWithRetry` uses `AbortSignal.any` per-attempt timeouts, body draining, and jitter-based backoff. `test()` uploads and deletes a probe object to validate write access.
apps/sim/lib/data-drains/destinations/azure_blob.ts	New Azure Blob destination using the `@azure/storage-blob` SDK with per-try timeouts and an SSRF-safe endpoint-suffix allowlist. Sovereign-cloud support via `endpointSuffix` is now wired from the API contract through to the form UI.
apps/sim/lib/data-drains/destinations/datadog.ts	New Datadog v2 logs intake destination with gzip, per-entry 1 MB + uncompressed 5 MB / wire 6 MB guards. NDJSON splitting now handles CRLF via `/\r?\n/` and `sleepUntilAborted` is used throughout.
apps/sim/lib/data-drains/destinations/utils.ts	New shared utilities: `sleepUntilAborted`, `backoffWithJitter`, `parseRetryAfter`, `normalizePrefix`, `buildObjectKey`, `parseServiceAccount`, `refineServiceAccountJson`, and `parseNdjsonObjects`. Used consistently across all five new destinations.
apps/sim/lib/api/contracts/data-drains.ts	API contract expanded with per-destination config and credentials schemas for all five new destinations. Azure `accountKey` enforces exact-88-char base64 and `endpointSuffix` is validated against the known-suffix allowlist.
apps/sim/ee/data-drains/destinations/registry.tsx	UI form specs added for all five new destinations. `isComplete` guards align with contract minimums. Webhook guard correctly uses `>= 32`. `endpointSuffix` optional field exposed for sovereign Azure clouds.
packages/db/migrations/0206_aromatic_veda.sql	Adds the five new enum values to the `data_drain_destination` Postgres type. All inserted `BEFORE 'webhook'` — safe additive DDL, no existing rows affected.

_{Reviews (25): Last reviewed commit: "fix(data-drains): cap snowflake poll ret..." | Re-trigger Greptile}

waleedlatif1 · 2026-05-11T03:23:46Z

@greptile

waleedlatif1 · 2026-05-11T03:23:49Z

@cursor review

waleedlatif1 · 2026-05-11T16:36:34Z

@greptile

waleedlatif1 · 2026-05-11T16:36:39Z

@cursor review

waleedlatif1 · 2026-05-11T17:01:55Z

@greptile

waleedlatif1 · 2026-05-11T17:02:03Z

@cursor review

waleedlatif1 · 2026-05-11T17:11:52Z

@greptile

waleedlatif1 · 2026-05-11T17:11:56Z

@cursor review

waleedlatif1 · 2026-05-11T17:23:58Z

@greptile

waleedlatif1 · 2026-05-11T17:24:03Z

@cursor review

Audited every destination against live AWS/GCS/Azure/BigQuery/Snowflake/ Datadog/webhook docs and applied spec-correctness fixes: - S3: reserved bucket prefix amzn-s3-demo-, suffixes --x-s3/--table-s3; metadata byte formula excludes x-amz-meta- prefix per AWS spec - GCS: reject -./.- adjacency; UTF-8 prefix cap; forbid .well-known/ acme-challenge/ prefix; ASCII-only x-goog-meta-* enforcement - BigQuery: insertId is 128 chars (not bytes); split DATASET_RE (ASCII) and TABLE_RE (Unicode L/M/N + connectors); UTF-8 byte cap on tableId - Snowflake: disambiguate org-account vs legacy locator account formats; requestId+retry=true for idempotent retries; server-side timeout=600; default column DATA uppercase to match unquoted canonical form - Azure: endpoint suffix allowlist (4 sovereign clouds); accountKey length(88) base64 - Webhook: url max(2048); CRLF/NUL rejection on bearer/secret/sig header

… parsing - snowflake pollStatement: per-attempt timeout via AbortSignal.any, retry on 429/5xx with Retry-After + jitter - bigquery parseNdjson error messages now 1-indexed - consolidate parseNdjson variants into shared parseNdjsonLines/parseNdjsonObjects in utils

…ke poll double-sleep - gcs.fetchWithRetry + bigquery.postInsertAll now use AbortSignal.any with a per-attempt timeout so a hung TCP connection cannot stall the drain - snowflake.pollStatement skips the next interval sleep when it just slept for retry backoff

…owflake column default UI/docs - bigquery test() probe now uses AbortSignal.any + per-attempt timeout - bigquery insertAll retry switches to backoffWithJitter for thundering-herd avoidance - Snowflake column placeholder + docs say DATA (uppercase) to match the code default

isComplete now requires signingSecret >= 32 to match the contract/runtime schema so the Save button can't enable on a value that will fail server-side.

Switch Snowflake to parseNdjsonObjects so malformed rows are caught locally with 1-indexed line numbers instead of failing the whole INSERT server-side. Re-stringify each parsed object before binding to PARSE_JSON(?). Drop the now-unused parseNdjsonLines helper.

- Azure: bound retryOptions on BlobServiceClient (SDK default tryTimeoutInMs is per-try unbounded; cap at 30s x 5 tries) - Webhook contract: mirror runtime — signingSecret.max(512), bearerToken.max(4096) + CRLF/NUL refine, signatureHeader charset + CRLF/NUL refine - S3 (lib + contract): reject bucket names with dash adjacent to dot; require https:// endpoint at the schema layer - Snowflake: bind original NDJSON line bytes (re-stringifying a JSON.parse'd value loses bigint precision beyond 2^53-1); check pollStatement 200 body for the SQL error envelope (sqlState/code) - Datadog: entry builder writes defaults first then user attrs then forced ddtags/message so user rows can't clobber routing fields; validate config.tags as comma-separated key:value pairs - registry.tsx: tighten isComplete predicates to mirror contract minimums (GCS bucket >= 3, Azure containerName >= 3 / accountKey === 88, BigQuery projectId >= 6, Snowflake account >= 3)

Previous fix placed ddsource/service before ...attrs, leaving them clobberable by a user row field. Per Datadog docs, service + ddsource pick the processing pipeline, so a drain's routing config must not be overridable per-row. Spread attrs first, then force all four reserved fields (ddsource, service, ddtags, message).

…ertId overflows Truncating from the left dropped the index suffix, so any overflow would collapse all rows in a chunk to the same insertId and BigQuery would silently dedupe them. Path is unreachable today (UUIDs keep raw ~85 chars), but the overflow branch is now correct: hash the prefix, keep the index intact.

- gcs: rebuild Authorization header per attempt via buildHeaders so token refresh from google-auth-library kicks in if a 5xx retry crosses the hour-long token lifetime - azure_blob: pin account-key regex to {0,2} trailing '=' (base64 of 64 bytes = exactly 88 chars with up to two '=' pad chars)

- gcs: allow 1-char dot-separated bucket components (e.g. "a.bucket") to match GCS naming rules — overall name is 3-63 (or up to 222 with dots), but per-component minimum is 1 per Google's spec - bigquery: drain the 401 response body before re-issuing the request with a refreshed token so undici can return the socket to the keep-alive pool - snowflake: hoist getJwt() above the perAttempt timer in executeStatement so JWT signing doesn't eat the network budget (matches the order already used in pollStatement)

…suffix The account validation rejected `<orgname>-<acctname>.<region>.<cloud>` because `ACCOUNT_LOCATOR_RE`'s first segment forbade hyphens, while `ACCOUNT_ORG_RE` forbade dots. `normalizeAccountForJwt` already handles this composite form. Widen the first segment of `ACCOUNT_LOCATOR_RE` to allow hyphens so the boundary contract and the runtime schema accept what the JWT layer was already designed to process.

Mirrors the bigquery 401 fix. Without consuming the body before sleeping, undici can't return the socket to the keep-alive pool, so each retry leaks a TCP connection instead of reusing it.

…atus Mirrors the bigquery/datadog/gcs drains. Long async statements can poll many times against the same connection; without consuming the body undici can't return the socket to the keep-alive pool, so each iteration leaks a connection until GC.

… 200 - gcs: drain the body on success paths so undici can return the socket to the keep-alive pool - snowflake: drain the body on synchronous 200 OK and run the same sqlState envelope check pollStatement already does — otherwise a statement-level failure that completes synchronously would silently return success

Same undici keep-alive issue as the prior fixes: postWithRetries returned the Response on success without draining (callers only read headers); the BigQuery `test()` probe returned without consuming the body. Both now drain before returning.

…ebase

waleedlatif1 · 2026-05-12T00:18:46Z

@greptile

waleedlatif1 · 2026-05-12T00:18:51Z

@cursor review

…n length

waleedlatif1 · 2026-05-12T00:26:13Z

@greptile

waleedlatif1 · 2026-05-12T00:26:16Z

@cursor review

cursor

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 58a35bf. Configure here.}

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread apps/sim/lib/api/contracts/data-drains.ts Outdated

greptile-apps Bot reviewed May 11, 2026

View reviewed changes

Comment thread apps/sim/ee/data-drains/destinations/registry.tsx Outdated

Comment thread apps/sim/lib/data-drains/destinations/bigquery.ts Outdated

Comment thread apps/sim/lib/data-drains/destinations/gcs.ts Outdated

vercel Bot deployed to Preview May 11, 2026 02:55 View deployment

vercel Bot temporarily deployed to Preview May 11, 2026 03:20 Inactive

greptile-apps Bot reviewed May 11, 2026

View reviewed changes

Comment thread apps/sim/lib/data-drains/destinations/snowflake.ts

Comment thread apps/sim/lib/data-drains/destinations/datadog.ts Outdated

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread apps/sim/lib/data-drains/destinations/gcs.ts Outdated

Comment thread apps/sim/lib/data-drains/destinations/snowflake.ts Outdated

Comment thread apps/sim/lib/data-drains/destinations/snowflake.ts Outdated

Comment thread apps/sim/lib/data-drains/destinations/bigquery.ts Outdated

vercel Bot temporarily deployed to Preview May 11, 2026 03:32 Inactive

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread apps/sim/lib/api/contracts/data-drains.ts Outdated

Comment thread apps/sim/lib/data-drains/destinations/gcs.ts Outdated

vercel Bot temporarily deployed to Preview May 11, 2026 16:53 Inactive

greptile-apps Bot reviewed May 11, 2026

View reviewed changes

Comment thread apps/sim/lib/api/contracts/data-drains.ts

Comment thread apps/sim/lib/api/contracts/data-drains.ts

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread apps/sim/lib/api/contracts/data-drains.ts

Comment thread apps/sim/lib/data-drains/destinations/gcs.ts Outdated

vercel Bot temporarily deployed to Preview May 11, 2026 17:10 Inactive

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread apps/sim/lib/data-drains/destinations/bigquery.ts Outdated

greptile-apps Bot reviewed May 11, 2026

View reviewed changes

Comment thread apps/sim/lib/api/contracts/data-drains.ts

vercel Bot temporarily deployed to Preview May 11, 2026 17:19 Inactive

vercel Bot temporarily deployed to Preview May 11, 2026 17:21 Inactive

cursor Bot reviewed May 11, 2026

View reviewed changes

Comment thread apps/sim/lib/data-drains/destinations/datadog.ts Outdated

Comment thread apps/sim/lib/data-drains/destinations/datadog.ts Outdated

waleedlatif1 added 17 commits May 11, 2026 17:17

fix(data-drains): mirror webhook signingSecret min length in form gate

520c2f8

isComplete now requires signingSecret >= 32 to match the contract/runtime schema so the Save button can't enable on a value that will fail server-side.

fix(data-drains): drain retryable response bodies in datadog/gcs loops

9f1808a

Mirrors the bigquery 401 fix. Without consuming the body before sleeping, undici can't return the socket to the keep-alive pool, so each retry leaks a TCP connection instead of reusing it.

chore(data-drains): regenerate enum migration as 0206 after staging r…

3804d01

…ebase

waleedlatif1 force-pushed the waleedlatif1/abuja-v1 branch from 8adf19d to 3804d01 Compare May 12, 2026 00:18

cursor Bot reviewed May 12, 2026

View reviewed changes

Comment thread apps/sim/lib/data-drains/destinations/azure_blob.ts

Comment thread apps/sim/lib/data-drains/destinations/datadog.ts

Comment thread apps/sim/lib/data-drains/destinations/snowflake.ts

vercel Bot deployed to Preview May 12, 2026 00:22 View deployment

fix(data-drains): cap snowflake poll retries; tighten datadog tags mi…

58a35bf

…n length

vercel Bot temporarily deployed to Preview May 12, 2026 00:25 Inactive

cursor Bot reviewed May 12, 2026

View reviewed changes

waleedlatif1 merged commit 39f74aa into staging May 12, 2026
14 checks passed

waleedlatif1 deleted the waleedlatif1/abuja-v1 branch May 12, 2026 00:37

waleedlatif1 mentioned this pull request May 12, 2026

v0.6.74: security hardening, workers recycling, next-mdx-remote and opentelemetry updates, data drains to snowflake, blob, datadog, bigquery #4561

Open

Conversation

waleedlatif1 commented May 11, 2026

Summary

Type of Change

Testing

Checklist

Uh oh!

vercel Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

greptile-apps Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented May 11, 2026

Uh oh!

waleedlatif1 commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented May 11, 2026

Uh oh!

waleedlatif1 commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented May 11, 2026

Uh oh!

waleedlatif1 commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented May 11, 2026

Uh oh!

waleedlatif1 commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented May 11, 2026

Uh oh!

waleedlatif1 commented May 11, 2026

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented May 12, 2026

Uh oh!

waleedlatif1 commented May 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

waleedlatif1 commented May 12, 2026

Uh oh!

waleedlatif1 commented May 12, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

vercel Bot commented May 11, 2026 •

edited

Loading

cursor Bot commented May 11, 2026 •

edited

Loading

greptile-apps Bot commented May 11, 2026 •

edited

Loading