Skip to content

feat(workloads): add performance metrics collection for DR drill testing#4

Draft
tvaron3 wants to merge 15 commits into
mainfrom
feat/dr-drill-workload-fixes
Draft

feat(workloads): add performance metrics collection for DR drill testing#4
tvaron3 wants to merge 15 commits into
mainfrom
feat/dr-drill-workload-fixes

Conversation

@tvaron3
Copy link
Copy Markdown
Owner

@tvaron3 tvaron3 commented Apr 10, 2026

Summary

Add a performance metrics library to the Python Cosmos DB workloads that reports PerfResult documents to a results Cosmos DB account, matching the Rust perf tool schema exactly. Both SDKs write to the same ADX → Grafana pipeline.

New Files

  • perf_stats.py — Thread-safe latency histogram with sorted-list percentile calculation and atomic drain_all() for consistent summary+error snapshots
  • perf_config.py — All config from environment variables (RESULTS_COSMOS_URI, PERF_REPORT_INTERVAL=300s, perfdb/perfresults defaults)
  • perf_reporter.py — Background daemon thread that drains Stats every 5 minutes and upserts PerfResult documents via sync CosmosClient with AAD auth (DefaultAzureCredential)

Modified Files

  • workload_configs.py — All configs now driven by environment variables with sensible defaults
  • workload_utils.py — Added timed operation wrappers with error tracking (CosmosHttpResponseError status_code/sub_status extraction). Only successful operations record latency.
  • All *_workload.py files — Integrated Stats + PerfReporter with try/finally lifecycle management

PerfResult Document Schema

Matches Rust exactly:

  • Identity: id, partition_key, workload_id, commit_sha, hostname, TIMESTAMP
  • Metrics: operation, count, errors, min_ms, max_ms, mean_ms, p50_ms, p90_ms, p99_ms
  • System: cpu_percent, memory_bytes, system_cpu_percent, system_total_memory_bytes, system_used_memory_bytes
  • Cross-SDK: sdk_language="python", sdk_version from azure.cosmos.__version__
  • Config: config_concurrency, config_application_region, config_excluded_regions, config_ppcb_enabled

Key Design Decisions

  • Sorted-list percentiles (no hdrhistogram native dependency needed)
  • psutil for CPU/memory with /proc fallback on Linux
  • Cached psutil.Process() instance for accurate CPU readings
  • CosmosClient properly stored and closed to avoid resource leaks
  • PPCB disabled by default
  • All reporter errors caught and logged as warnings — never crash the workload
  • Error latencies excluded from success percentiles to avoid metric pollution

tvaron3 and others added 2 commits April 9, 2026 19:29
…hroughput

- Uncomment concurrent upsert/read/query calls
- Remove manual timing counters and log_request_counts
- Set THROUGHPUT to 1000000 in workload_configs.py
- Keep CIRCUIT_BREAKER_ENABLED = False (PPCB disabled)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a performance metrics library that reports PerfResult documents to a
Cosmos DB results account, matching the Rust perf tool schema exactly so
both SDKs feed the same ADX → Grafana pipeline.

New files:
- perf_stats.py: Thread-safe latency histogram with sorted-list percentile
  calculation and atomic drain_all() for consistent summary+error snapshots
- perf_config.py: All config from environment variables (RESULTS_COSMOS_URI,
  PERF_REPORT_INTERVAL=300s, perfdb/perfresults defaults)
- perf_reporter.py: Background daemon thread that drains Stats every 5 min
  and upserts PerfResult documents via sync CosmosClient with AAD auth

Modified files:
- workload_configs.py: All configs now driven by environment variables
- workload_utils.py: Added timed operation wrappers with error tracking
  (CosmosHttpResponseError status_code/sub_status extraction), only
  successful operations record latency to avoid polluting percentiles
- All *_workload.py files: Integrated Stats + PerfReporter with try/finally
  lifecycle management

Key design decisions:
- Sorted-list percentiles (no hdrhistogram native dependency)
- psutil for CPU/memory with /proc fallback on Linux
- Cached psutil.Process() instance for accurate CPU readings
- CosmosClient stored and closed properly to avoid resource leaks
- sdk_language='python', sdk_version from azure.cosmos.__version__
- PPCB disabled by default
- All reporter errors caught and logged as warnings (never crash workload)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tvaron3 and others added 13 commits April 9, 2026 23:26
psutil is now a hard import (not optional). Removed all /proc/meminfo
and /proc/self/status fallback parsing — if psutil is not installed,
the import will fail immediately rather than silently degrading.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Single workload.py replaces 6 operation-specific files
- WORKLOAD_OPERATIONS env var controls which ops run (read,write,query)
- WORKLOAD_USE_PROXY env var enables Envoy proxy routing
- WORKLOAD_USE_SYNC env var enables sync client
- Validate operation names at import time with clear error
- Replace manual sorted-list percentiles with hdrhistogram (O(1) record/query)
- Fixed memory usage (~40KB per histogram vs unbounded list growth)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rkload.py

Removed: r_workload.py, w_workload.py, r_proxy_workload.py,
w_proxy_workload.py, r_w_q_workload.py, r_w_q_proxy_workload.py,
r_w_q_sync_workload.py

All replaced by workload.py with WORKLOAD_OPERATIONS and
WORKLOAD_USE_PROXY env vars.

Kept: r_w_q_with_incorrect_client_workload.py (special test case)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces r_w_q_with_incorrect_client_workload.py with an env var:
WORKLOAD_SKIP_CLOSE=true creates the client without a context manager,
simulating applications that don't properly close the Cosmos client.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch from time.perf_counter() * 1000 to time.perf_counter_ns() / 1_000_000
for nanosecond precision without floating-point multiplication artifacts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Infra/orchestration scripts belong in the cosmos-sdk-copilot-toolkit repo,
not in the SDK repo. Workload code (workload.py, perf_*, workload_utils.py)
stays here.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…istogram

The pip package is 'hdrhistogram' but the Python module is 'hdrh'.
Import changed from 'from hdrhistogram import HdrHistogram' to
'from hdrh.histogram import HdrHistogram'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reports COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS in the config snapshot
so it's visible in the Grafana dashboard and queryable from Kusto.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The variable was used but never defined — caused pylint E0602.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, histogram clamp, safe parsing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…dictionary

Move cspell words to sdk/cosmos/azure-cosmos/cspell.json instead of
root .vscode/cspell.json to keep changes within cosmos folder scope.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant