Skip to content

[PECOBLR-1381][PECOBLR-1382] Implement telemetry Phase 6-7: Collection, Aggregation & Driver Integration#320

Open
samikshya-db wants to merge 15 commits intomainfrom
stack/PECOBLR-1381-1382-telemetry-phase6-7
Open

[PECOBLR-1381][PECOBLR-1382] Implement telemetry Phase 6-7: Collection, Aggregation & Driver Integration#320
samikshya-db wants to merge 15 commits intomainfrom
stack/PECOBLR-1381-1382-telemetry-phase6-7

Conversation

@samikshya-db
Copy link
Collaborator

Summary

This stacked PR builds on #319 and implements Phases 6-7 of the telemetry system, completing the full pipeline.

Stack: Part 2 of 2


Phase 6: Metric Collection & Aggregation ✅

New Files

errors.go (108 lines)

  • isTerminalError() - Non-retryable error detection
  • classifyError() - Error categorization
  • ✅ HTTP error handling utilities

interceptor.go (146 lines)

  • BeforeExecute() / AfterExecute() hooks
  • ✅ Context-based metric tracking
  • ✅ Latency measurement
  • ✅ Tag collection
  • ✅ Error swallowing

aggregator.go (242 lines)

  • ✅ Statement-level aggregation
  • ✅ Batch processing (size: 100)
  • ✅ Background flush (interval: 5s)
  • ✅ Thread-safe operations
  • ✅ Immediate flush for terminal errors

client.go (updated)

  • ✅ Full pipeline integration
  • ✅ Graceful shutdown

Phase 7: Driver Integration ✅

Configuration Support

internal/config/config.go (+18 lines)

  • EnableTelemetry field
  • ForceEnableTelemetry field
  • ✅ DSN parameter parsing
  • DeepCopy() support

Connection Integration

connection.go, connector.go (+20 lines)

  • ✅ Telemetry field in conn struct
  • ✅ Initialization in Connect()
  • ✅ Cleanup in Close()

Helper Module

driver_integration.go (59 lines)

  • InitializeForConnection() - Setup
  • ReleaseForConnection() - Cleanup
  • ✅ Feature flag checking
  • ✅ Resource management

Integration Flow

DSN: "host:port/path?enableTelemetry=true"
    ↓
connector.Connect()
    ↓
telemetry.InitializeForConnection()
    ├─→ Feature flag check (5-level priority)
    ├─→ Get/Create telemetryClient (per host)
    └─→ Create Interceptor (per connection)
    ↓
conn.telemetry = Interceptor
    ↓
conn.Close()
    ├─→ Flush pending metrics
    └─→ Release resources

Changes

Total: +1,073 insertions, -48 deletions (13 files)

Phase 6:

  • telemetry/errors.go (108 lines) - NEW
  • telemetry/interceptor.go (146 lines) - NEW
  • telemetry/aggregator.go (242 lines) - NEW
  • telemetry/client.go (+27/-9) - MODIFIED

Phase 7:

  • telemetry/driver_integration.go (59 lines) - NEW
  • internal/config/config.go (+18) - MODIFIED
  • connection.go (+10) - MODIFIED
  • connector.go (+10) - MODIFIED
  • telemetry/DESIGN.md - MODIFIED

Testing

All tests passing

  • ✅ 70+ telemetry tests (2.018s)
  • ✅ No breaking changes
  • ✅ Compilation verified
  • ✅ Thread-safety verified

Usage Example

// Enable telemetry via DSN
dsn := "host:443/sql/1.0?enableTelemetry=true"
db, _ := sql.Open("databricks", dsn)

// Or force enable
dsn := "host:443/sql/1.0?forceEnableTelemetry=true"

Related Issues


Checklist

  • Phase 6: Collection & aggregation
  • Phase 7: Driver integration
  • Configuration support
  • Resource management
  • All tests passing
  • No breaking changes
  • DESIGN.md updated

…nd opt-in configuration

This commit implements the remaining components for PECOBLR-1143 (Phases 4-5):

Phase 4: Export Infrastructure
- Implement telemetryExporter with HTTP POST to /api/2.0/telemetry-ext
- Add retry logic with exponential backoff (100ms base, 3 retries)
- Integrate with circuit breaker for endpoint protection
- Implement tag filtering via shouldExportToDatabricks()
- Add error swallowing to ensure telemetry never impacts driver
- Support both http:// and https:// URLs for testing

Phase 5: Opt-In Configuration Integration
- Implement isTelemetryEnabled() with 5-level priority logic:
  1. forceEnableTelemetry=true - bypasses all server checks
  2. enableTelemetry=false - explicit opt-out
  3. enableTelemetry=true + server flag - user opt-in with server control
  4. Server flag only - default Databricks-controlled behavior
  5. Default disabled - fail-safe default
- Wire up with existing featureFlagCache for server flag checks
- Handle errors gracefully (default to disabled on failures)

Testing:
- Add 17 comprehensive unit tests for exporter (success, retries, circuit breaker, tag filtering, error swallowing, exponential backoff, context cancellation)
- Add 8 unit tests for isTelemetryEnabled (all 5 priority levels, error handling, server scenarios)
- All 70+ telemetry tests passing

Documentation:
- Update DESIGN.md checklist to mark Phases 3-5 as completed

This completes the core telemetry infrastructure for PECOBLR-1143.
Next phases (6-7) will add metric collection and driver integration.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
samikshya-db and others added 13 commits February 4, 2026 00:39
Address PR review comments by implementing config overlay pattern:

**Changes:**
- Remove ForceEnableTelemetry flag (no longer needed)
- Change EnableTelemetry from bool to *bool pointer
  - nil = use server feature flag (default)
  - true = client forces enabled (overrides server)
  - false = client forces disabled (overrides server)
- Simplify priority logic to 3 clear layers:
  1. Client Config (explicit) - highest priority
  2. Server Config (feature flag) - when client doesn't set
  3. Fail-Safe Default (disabled) - when server unavailable

**Benefits:**
- Simpler mental model: client > server > default
- Client config naturally has priority (no special bypass flag)
- Foundation for general config overlay system
- All 89 telemetry tests passing

Addresses review comments:
- "if we take the config overlay approach that client side config take
  priority? we won't need the concept of force enablement?"
- Sets foundation for extending to all driver configs (not just telemetry)
…sibility

Config Overlay System:
- Add internal/config/overlay.go with generic ConfigValue[T] type
- Support client > server > default priority across all driver configs
- Refactor telemetry to use ConfigValue[bool] instead of *bool
- Add comprehensive tests for overlay pattern (20+ test cases)

Feature Flag Refactoring:
- Support multiple feature flags in single HTTP request
- Change from single flag to map[string]bool storage
- Add getAllFeatureFlags() for easy flag registration
- Add getFeatureFlag() generic method for any flag
- Keep isTelemetryEnabled() API backward compatible
- Add ADDING_FEATURE_FLAGS.md guide for extending system

Documentation Updates:
- Document config overlay pattern in DESIGN.md
- Clarify synchronous fetch behavior (10s timeout, RetryableClient)
- Document blocking scenarios and thundering herd protection
- Add rationale for synchronous vs async approach

Benefits:
- Easy to add new flags (3 lines of code)
- Single HTTP request fetches all flags (efficient)
- Reusable ConfigValue[T] pattern for all driver configs
- All 89 telemetry tests passing
- Run gofmt -s on all telemetry files
- Add explicit error ignore for json.Encoder.Encode in tests
- Add comment to empty select case to fix SA9003 staticcheck warning
- Add explanatory comment to empty default case in connection.go:340
- This fixes the staticcheck SA9003 warning about empty branch
- The select statement uses default as non-blocking context check

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Changed from /api/2.0/telemetry-ext to /telemetry-ext
- Verified against JDBC driver PathConstants.java:12
- TELEMETRY_PATH = "/telemetry-ext"
- First connection now creates cache context and fetches from server
- Ensures telemetry respects actual server flag on initial connection
- Subsequent connections remain non-blocking with cached value
- Double-checked locking pattern to prevent race conditions
- Fix TestExport_Success: correct expected endpoint path to /telemetry-ext
- Fix TestFeatureFlagCache_IsTelemetryEnabled_NoContext: provide valid httpClient to avoid nil pointer dereference

Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
Add explicit statements to empty if blocks to satisfy staticcheck SA9003.
These branches are intentionally empty as they swallow errors/panics
to ensure telemetry never impacts driver operation.

Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
…gation

This commit implements Phase 6 (metric collection and aggregation) for the
telemetry system.

Phase 6: Metric Collection & Aggregation
- Implement error classification (errors.go)
  - isTerminalError() for identifying non-retryable errors
  - classifyError() for categorizing errors for telemetry
  - HTTP error handling utilities

- Implement telemetry interceptor (interceptor.go)
  - beforeExecute() / afterExecute() hooks for statement execution
  - Context-based metric tracking with metricContext
  - Latency measurement and tag collection
  - Connection event recording
  - Error swallowing with panic recovery

- Implement metrics aggregator (aggregator.go)
  - Statement-level metric aggregation
  - Batch size and flush interval logic
  - Background flush goroutine with ticker
  - Thread-safe metric recording with mutex protection
  - Immediate flush for connection and terminal errors
  - Aggregated counts (chunks, bytes, polls)

- Update telemetryClient (client.go)
  - Wire up aggregator with exporter
  - Automatic aggregator start in constructor
  - Graceful shutdown with 5s timeout
  - getInterceptor() for per-connection interceptors

Architecture:
- Each connection gets its own interceptor instance
- All interceptors share the same aggregator (per host)
- Aggregator batches metrics and flushes periodically
- Exporter sends batched metrics to Databricks
- Circuit breaker protects against endpoint failures

Testing:
- All 70+ existing tests continue to pass
- Compilation verified, no breaking changes

Note: Phase 7 (driver integration) will be completed separately to allow
careful review and testing of hooks in connection.go and statement.go.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit implements Phase 7 (driver integration) for the telemetry system,
completing the full telemetry pipeline from driver operations to export.

Phase 7: Driver Integration
- Add telemetry configuration to UserConfig
  - EnableTelemetry: User opt-in flag (respects server feature flags)
  - ForceEnableTelemetry: Force enable flag (bypasses server checks)
  - DSN parameter parsing in ParseDSN()
  - DeepCopy support for telemetry fields

- Add telemetry support to connection
  - Add telemetry field to conn struct (*telemetry.Interceptor)
  - Initialize telemetry in connector.Connect()
  - Release telemetry resources in conn.Close()
  - Graceful shutdown with pending metric flush

- Export telemetry types for driver use
  - Export Interceptor type (was interceptor)
  - Export GetInterceptor() method (was getInterceptor)
  - Export Close() method (was close)

- Create driver integration helper (driver_integration.go)
  - InitializeForConnection(): One-stop initialization
  - ReleaseForConnection(): Resource cleanup
  - Encapsulates feature flag checks and client management
  - Reference counting for per-host resources

Integration Flow:
1. User sets enableTelemetry=true or forceEnableTelemetry=true in DSN
2. connector.Connect() calls telemetry.InitializeForConnection()
3. Telemetry checks feature flags and returns Interceptor if enabled
4. Connection uses Interceptor for metric collection (Phase 8)
5. conn.Close() releases telemetry resources

Architecture:
- Per-connection: Interceptor instance
- Per-host (shared): telemetryClient, aggregator, exporter
- Global (singleton): clientManager, featureFlagCache, circuitBreakerManager

Opt-In Priority (5 levels):
1. forceEnableTelemetry=true - Always enabled (testing/internal)
2. enableTelemetry=false - Always disabled (explicit opt-out)
3. enableTelemetry=true + server flag - User opt-in with server control
4. Server flag only - Default Databricks-controlled behavior
5. Default - Disabled (fail-safe)

Testing:
- All 70+ telemetry tests passing
- No breaking changes to existing driver tests
- Compilation verified across all packages
- Graceful handling when telemetry disabled

Note: Statement hooks (beforeExecute/afterExecute) will be added in follow-up
for actual metric collection during query execution.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Adapt InitializeForConnection to use ConfigValue[bool] pattern from base branch.
- Change enableTelemetry parameter to *bool (nil = unset, check server)
- Use config.NewConfigValue() to set explicit values
- Update connector.go call site to pass pointer
- Remove ForceEnableTelemetry (now handled by ConfigValue override)

Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
@samikshya-db samikshya-db force-pushed the stack/PECOBLR-1381-1382-telemetry-phase6-7 branch from db32fa3 to f388244 Compare February 5, 2026 07:33
Base automatically changed from stack/PECOBLR-1143-telemetry-phase4-5 to main February 9, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant