Commit 39f74aa
authored
feat(data-drains): add GCS, Azure Blob, BigQuery, Snowflake, and Datadog destinations (#4552)
* feat(data-drains): add GCS, Azure Blob, BigQuery, Snowflake, and Datadog destinations
* fix(data-drains): address PR review comments
* fix(data-drains): extract sleepUntilAborted, honor abort across all destinations
* fix(data-drains): widen BigQuery projectId max and dedupe parseServiceAccount
* fix(data-drains): tighten GCS bucket contract and expose Azure endpointSuffix
* improvement(data-drains): extract normalizePrefix and buildObjectKey to shared utils
* fix(data-drains): retry BigQuery network errors; tighten Azure accountKey contract
- BigQuery insertAll now wraps the fetch in try/catch inside the retry loop so DNS failures, socket resets, and timeouts are retried with backoff instead of propagating immediately.
- Align azureBlobCredentialsBodySchema with the runtime schema (min 64 / max 120 / base64 regex) so obviously invalid keys are rejected at the API boundary rather than at drain-run time.
* improvement(data-drains): consolidate parseRetryAfter; add Datadog NDJSON line context
- Extract a single parseRetryAfter helper (capped at 30s, returns number | null) into lib/data-drains/destinations/utils.ts and remove the five local copies in bigquery, datadog, gcs, snowflake, and webhook.
- Datadog parseNdjson now wraps JSON.parse in try/catch and surfaces the failing line index, matching BigQuery's parser.
* fix(data-drains): correct Datadog size guard and Snowflake VARIANT limit
- Datadog payload guard now checks the uncompressed size against the 5 MB limit and the wire size against the 6 MB compressed limit, so gzip cannot smuggle an oversized body past the client-side check.
- Snowflake VARIANT limit is 16 MiB (16,777,216 bytes), not 16,000,000 bytes — small payloads between 16 MB and 16 MiB were being rejected unnecessarily.
- Drop the unused apiKey field on Datadog PostInput; the key is already embedded in the prepared request headers.
* improvement(data-drains): consolidate backoffWithJitter into shared utils
Datadog, GCS, and webhook each had byte-identical backoff helpers (BASE 500ms, MAX 30s, jitter ±20%, Retry-After floor). Lift the helper into lib/data-drains/destinations/utils.ts alongside parseRetryAfter and sleepUntilAborted, and drop the per-file copies and their BASE_BACKOFF_MS/MAX_BACKOFF_MS constants.
* fix(data-drains): align destinations with live provider specs
Audited every destination against live AWS/GCS/Azure/BigQuery/Snowflake/
Datadog/webhook docs and applied spec-correctness fixes:
- S3: reserved bucket prefix amzn-s3-demo-, suffixes --x-s3/--table-s3;
metadata byte formula excludes x-amz-meta- prefix per AWS spec
- GCS: reject -./.- adjacency; UTF-8 prefix cap; forbid .well-known/
acme-challenge/ prefix; ASCII-only x-goog-meta-* enforcement
- BigQuery: insertId is 128 chars (not bytes); split DATASET_RE (ASCII)
and TABLE_RE (Unicode L/M/N + connectors); UTF-8 byte cap on tableId
- Snowflake: disambiguate org-account vs legacy locator account formats;
requestId+retry=true for idempotent retries; server-side timeout=600;
default column DATA uppercase to match unquoted canonical form
- Azure: endpoint suffix allowlist (4 sovereign clouds); accountKey
length(88) base64
- Webhook: url max(2048); CRLF/NUL rejection on bearer/secret/sig header
* fix(data-drains): address PR review on snowflake poll + shared NDJSON parsing
- snowflake pollStatement: per-attempt timeout via AbortSignal.any, retry on 429/5xx with Retry-After + jitter
- bigquery parseNdjson error messages now 1-indexed
- consolidate parseNdjson variants into shared parseNdjsonLines/parseNdjsonObjects in utils
* fix(data-drains): per-attempt fetch timeouts in gcs/bigquery, snowflake poll double-sleep
- gcs.fetchWithRetry + bigquery.postInsertAll now use AbortSignal.any with a per-attempt timeout so a hung TCP connection cannot stall the drain
- snowflake.pollStatement skips the next interval sleep when it just slept for retry backoff
* fix(data-drains): bigquery probe timeout + jittered retries, align Snowflake column default UI/docs
- bigquery test() probe now uses AbortSignal.any + per-attempt timeout
- bigquery insertAll retry switches to backoffWithJitter for thundering-herd avoidance
- Snowflake column placeholder + docs say DATA (uppercase) to match the code default
* fix(data-drains): mirror webhook signingSecret min length in form gate
isComplete now requires signingSecret >= 32 to match the contract/runtime
schema so the Save button can't enable on a value that will fail server-side.
* fix(data-drains): validate JSON client-side for Snowflake before binding
Switch Snowflake to parseNdjsonObjects so malformed rows are caught locally
with 1-indexed line numbers instead of failing the whole INSERT server-side.
Re-stringify each parsed object before binding to PARSE_JSON(?).
Drop the now-unused parseNdjsonLines helper.
* fix(data-drains): cross-cutting audit pass against live provider docs
- Azure: bound retryOptions on BlobServiceClient (SDK default tryTimeoutInMs is per-try unbounded; cap at 30s x 5 tries)
- Webhook contract: mirror runtime — signingSecret.max(512), bearerToken.max(4096) + CRLF/NUL refine, signatureHeader charset + CRLF/NUL refine
- S3 (lib + contract): reject bucket names with dash adjacent to dot; require https:// endpoint at the schema layer
- Snowflake: bind original NDJSON line bytes (re-stringifying a JSON.parse'd value loses bigint precision beyond 2^53-1); check pollStatement 200 body for the SQL error envelope (sqlState/code)
- Datadog: entry builder writes defaults first then user attrs then forced ddtags/message so user rows can't clobber routing fields; validate config.tags as comma-separated key:value pairs
- registry.tsx: tighten isComplete predicates to mirror contract minimums (GCS bucket >= 3, Azure containerName >= 3 / accountKey === 88, BigQuery projectId >= 6, Snowflake account >= 3)
* fix(data-drains): force ddsource/service overrides on Datadog entries
Previous fix placed ddsource/service before ...attrs, leaving them clobberable
by a user row field. Per Datadog docs, service + ddsource pick the processing
pipeline, so a drain's routing config must not be overridable per-row. Spread
attrs first, then force all four reserved fields (ddsource, service, ddtags,
message).
* fix(data-drains): preserve row-distinguishing index when BigQuery insertId overflows
Truncating from the left dropped the index suffix, so any overflow would
collapse all rows in a chunk to the same insertId and BigQuery would silently
dedupe them. Path is unreachable today (UUIDs keep raw ~85 chars), but the
overflow branch is now correct: hash the prefix, keep the index intact.
* fix(data-drains): refresh GCS token per retry, tighten Azure key regex
- gcs: rebuild Authorization header per attempt via buildHeaders so token
refresh from google-auth-library kicks in if a 5xx retry crosses the
hour-long token lifetime
- azure_blob: pin account-key regex to {0,2} trailing '=' (base64 of 64
bytes = exactly 88 chars with up to two '=' pad chars)
* fix(data-drains): address bugbot review of 6336948
- gcs: allow 1-char dot-separated bucket components (e.g. "a.bucket")
to match GCS naming rules — overall name is 3-63 (or up to 222 with
dots), but per-component minimum is 1 per Google's spec
- bigquery: drain the 401 response body before re-issuing the request
with a refreshed token so undici can return the socket to the
keep-alive pool
- snowflake: hoist getJwt() above the perAttempt timer in
executeStatement so JWT signing doesn't eat the network budget
(matches the order already used in pollStatement)
* fix(data-drains): allow org-account Snowflake identifier with region suffix
The account validation rejected `<orgname>-<acctname>.<region>.<cloud>`
because `ACCOUNT_LOCATOR_RE`'s first segment forbade hyphens, while
`ACCOUNT_ORG_RE` forbade dots. `normalizeAccountForJwt` already handles
this composite form. Widen the first segment of `ACCOUNT_LOCATOR_RE` to
allow hyphens so the boundary contract and the runtime schema accept
what the JWT layer was already designed to process.
* fix(data-drains): drain retryable response bodies in datadog/gcs loops
Mirrors the bigquery 401 fix. Without consuming the body before
sleeping, undici can't return the socket to the keep-alive pool, so
each retry leaks a TCP connection instead of reusing it.
* fix(data-drains): drain snowflake poll bodies on 202 and retryable status
Mirrors the bigquery/datadog/gcs drains. Long async statements can poll
many times against the same connection; without consuming the body
undici can't return the socket to the keep-alive pool, so each iteration
leaks a connection until GC.
* fix(data-drains): consume success bodies; check Snowflake sqlState on 200
- gcs: drain the body on success paths so undici can return the socket
to the keep-alive pool
- snowflake: drain the body on synchronous 200 OK and run the same
sqlState envelope check pollStatement already does — otherwise a
statement-level failure that completes synchronously would silently
return success
* fix(data-drains): drain datadog and bigquery probe success bodies
Same undici keep-alive issue as the prior fixes: postWithRetries
returned the Response on success without draining (callers only read
headers); the BigQuery `test()` probe returned without consuming the
body. Both now drain before returning.
* chore(data-drains): regenerate enum migration as 0206 after staging rebase
* fix(data-drains): cap snowflake poll retries; tighten datadog tags min length1 parent 0b2cfaf commit 39f74aa
24 files changed
Lines changed: 19784 additions & 151 deletions
File tree
- apps
- docs/content/docs/en/enterprise
- sim
- components
- ee/data-drains
- components
- destinations
- lib
- api/contracts
- data-drains
- destinations
- packages/db
- migrations
- meta
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
8 | | - | |
| 8 | + | |
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
| |||
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
70 | 126 | | |
71 | 127 | | |
72 | 128 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6849 | 6849 | | |
6850 | 6850 | | |
6851 | 6851 | | |
| 6852 | + | |
| 6853 | + | |
| 6854 | + | |
| 6855 | + | |
| 6856 | + | |
| 6857 | + | |
| 6858 | + | |
| 6859 | + | |
| 6860 | + | |
| 6861 | + | |
| 6862 | + | |
| 6863 | + | |
| 6864 | + | |
| 6865 | + | |
| 6866 | + | |
| 6867 | + | |
| 6868 | + | |
| 6869 | + | |
| 6870 | + | |
| 6871 | + | |
| 6872 | + | |
| 6873 | + | |
| 6874 | + | |
Lines changed: 29 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
33 | | - | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
34 | 41 | | |
35 | 42 | | |
36 | 43 | | |
| |||
62 | 69 | | |
63 | 70 | | |
64 | 71 | | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
65 | 77 | | |
66 | 78 | | |
67 | 79 | | |
| |||
73 | 85 | | |
74 | 86 | | |
75 | 87 | | |
76 | | - | |
77 | | - | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
78 | 104 | | |
79 | 105 | | |
80 | 106 | | |
| |||
0 commit comments