feat(data-drains): add GCS, Azure Blob, BigQuery, Snowflake, and Datadog destinations#4552
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryHigh Risk Overview The settings UI is extended with new destination forms/options and icons, and the API contracts/schemas are expanded to support the new destination types with significantly stricter validation (bucket/account/table identifier rules, allowlisted Azure endpoint suffixes, BigQuery/Snowflake identifiers). Separately, S3 and webhook destinations are hardened: S3 key/metadata byte-limit checks and stronger endpoint/region/bucket validation; webhook validation tightens (SSRF-safe URL rules, reserved/safer signature header constraints, longer signing secret, bearer token sanitation) and shares common retry/backoff helpers. Docs are updated to describe all supported destinations, and a DB migration adds the new enum values. Reviewed by Cursor Bugbot for commit 58a35bf. Configure here. |
Greptile SummaryThis PR adds five new data-drain destinations (GCS, Azure Blob Storage, BigQuery, Snowflake, and Datadog) alongside full API contract schemas, a settings UI, database migration, and comprehensive unit tests. All destinations implement
Confidence Score: 4/5Safe to merge after resolving the outstanding Snowflake test() validation gap flagged in the prior review round. All issues identified in the previous review rounds appear fixed in the current HEAD. The one remaining open gap is Snowflake's apps/sim/lib/data-drains/destinations/snowflake.ts — the Important Files Changed
Reviews (25): Last reviewed commit: "fix(data-drains): cap snowflake poll ret..." | Re-trigger Greptile |
|
@greptile |
|
@cursor review |
|
@greptile |
|
@cursor review |
|
@greptile |
|
@cursor review |
|
@greptile |
|
@cursor review |
|
@greptile |
|
@cursor review |
Audited every destination against live AWS/GCS/Azure/BigQuery/Snowflake/ Datadog/webhook docs and applied spec-correctness fixes: - S3: reserved bucket prefix amzn-s3-demo-, suffixes --x-s3/--table-s3; metadata byte formula excludes x-amz-meta- prefix per AWS spec - GCS: reject -./.- adjacency; UTF-8 prefix cap; forbid .well-known/ acme-challenge/ prefix; ASCII-only x-goog-meta-* enforcement - BigQuery: insertId is 128 chars (not bytes); split DATASET_RE (ASCII) and TABLE_RE (Unicode L/M/N + connectors); UTF-8 byte cap on tableId - Snowflake: disambiguate org-account vs legacy locator account formats; requestId+retry=true for idempotent retries; server-side timeout=600; default column DATA uppercase to match unquoted canonical form - Azure: endpoint suffix allowlist (4 sovereign clouds); accountKey length(88) base64 - Webhook: url max(2048); CRLF/NUL rejection on bearer/secret/sig header
… parsing - snowflake pollStatement: per-attempt timeout via AbortSignal.any, retry on 429/5xx with Retry-After + jitter - bigquery parseNdjson error messages now 1-indexed - consolidate parseNdjson variants into shared parseNdjsonLines/parseNdjsonObjects in utils
…ke poll double-sleep - gcs.fetchWithRetry + bigquery.postInsertAll now use AbortSignal.any with a per-attempt timeout so a hung TCP connection cannot stall the drain - snowflake.pollStatement skips the next interval sleep when it just slept for retry backoff
…owflake column default UI/docs - bigquery test() probe now uses AbortSignal.any + per-attempt timeout - bigquery insertAll retry switches to backoffWithJitter for thundering-herd avoidance - Snowflake column placeholder + docs say DATA (uppercase) to match the code default
isComplete now requires signingSecret >= 32 to match the contract/runtime schema so the Save button can't enable on a value that will fail server-side.
Switch Snowflake to parseNdjsonObjects so malformed rows are caught locally with 1-indexed line numbers instead of failing the whole INSERT server-side. Re-stringify each parsed object before binding to PARSE_JSON(?). Drop the now-unused parseNdjsonLines helper.
- Azure: bound retryOptions on BlobServiceClient (SDK default tryTimeoutInMs is per-try unbounded; cap at 30s x 5 tries) - Webhook contract: mirror runtime — signingSecret.max(512), bearerToken.max(4096) + CRLF/NUL refine, signatureHeader charset + CRLF/NUL refine - S3 (lib + contract): reject bucket names with dash adjacent to dot; require https:// endpoint at the schema layer - Snowflake: bind original NDJSON line bytes (re-stringifying a JSON.parse'd value loses bigint precision beyond 2^53-1); check pollStatement 200 body for the SQL error envelope (sqlState/code) - Datadog: entry builder writes defaults first then user attrs then forced ddtags/message so user rows can't clobber routing fields; validate config.tags as comma-separated key:value pairs - registry.tsx: tighten isComplete predicates to mirror contract minimums (GCS bucket >= 3, Azure containerName >= 3 / accountKey === 88, BigQuery projectId >= 6, Snowflake account >= 3)
Previous fix placed ddsource/service before ...attrs, leaving them clobberable by a user row field. Per Datadog docs, service + ddsource pick the processing pipeline, so a drain's routing config must not be overridable per-row. Spread attrs first, then force all four reserved fields (ddsource, service, ddtags, message).
…ertId overflows Truncating from the left dropped the index suffix, so any overflow would collapse all rows in a chunk to the same insertId and BigQuery would silently dedupe them. Path is unreachable today (UUIDs keep raw ~85 chars), but the overflow branch is now correct: hash the prefix, keep the index intact.
- gcs: rebuild Authorization header per attempt via buildHeaders so token
refresh from google-auth-library kicks in if a 5xx retry crosses the
hour-long token lifetime
- azure_blob: pin account-key regex to {0,2} trailing '=' (base64 of 64
bytes = exactly 88 chars with up to two '=' pad chars)
- gcs: allow 1-char dot-separated bucket components (e.g. "a.bucket") to match GCS naming rules — overall name is 3-63 (or up to 222 with dots), but per-component minimum is 1 per Google's spec - bigquery: drain the 401 response body before re-issuing the request with a refreshed token so undici can return the socket to the keep-alive pool - snowflake: hoist getJwt() above the perAttempt timer in executeStatement so JWT signing doesn't eat the network budget (matches the order already used in pollStatement)
…suffix The account validation rejected `<orgname>-<acctname>.<region>.<cloud>` because `ACCOUNT_LOCATOR_RE`'s first segment forbade hyphens, while `ACCOUNT_ORG_RE` forbade dots. `normalizeAccountForJwt` already handles this composite form. Widen the first segment of `ACCOUNT_LOCATOR_RE` to allow hyphens so the boundary contract and the runtime schema accept what the JWT layer was already designed to process.
Mirrors the bigquery 401 fix. Without consuming the body before sleeping, undici can't return the socket to the keep-alive pool, so each retry leaks a TCP connection instead of reusing it.
…atus Mirrors the bigquery/datadog/gcs drains. Long async statements can poll many times against the same connection; without consuming the body undici can't return the socket to the keep-alive pool, so each iteration leaks a connection until GC.
… 200 - gcs: drain the body on success paths so undici can return the socket to the keep-alive pool - snowflake: drain the body on synchronous 200 OK and run the same sqlState envelope check pollStatement already does — otherwise a statement-level failure that completes synchronously would silently return success
Same undici keep-alive issue as the prior fixes: postWithRetries returned the Response on success without draining (callers only read headers); the BigQuery `test()` probe returned without consuming the body. Both now drain before returning.
8adf19d to
3804d01
Compare
|
@greptile |
|
@cursor review |
|
@greptile |
|
@cursor review |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 58a35bf. Configure here.
Summary
test()+openSession()/deliver()with provider-spec-correct auth, retry, byte-accurate size guards, and abort-signal forwardingtabledata.insertAllwith drainId-prefixed insertId dedup, partial-failure surfacing, 401 token refresh + 5xx/429 retryap2goog/google)@azure/storage-blobSDK with sovereign-cloudendpointSuffixsupportenterprise/data-drains.mdxType of Change
Testing
Tested manually. New unit tests cover schema validation, retry paths, byte-accurate size guards, gzip, sovereign-cloud routing, and partial-failure handling per destination.
Checklist