feat(workloads): add performance metrics collection for DR drill testing#4
Draft
tvaron3 wants to merge 15 commits into
Draft
feat(workloads): add performance metrics collection for DR drill testing#4tvaron3 wants to merge 15 commits into
tvaron3 wants to merge 15 commits into
Conversation
…hroughput - Uncomment concurrent upsert/read/query calls - Remove manual timing counters and log_request_counts - Set THROUGHPUT to 1000000 in workload_configs.py - Keep CIRCUIT_BREAKER_ENABLED = False (PPCB disabled) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a performance metrics library that reports PerfResult documents to a Cosmos DB results account, matching the Rust perf tool schema exactly so both SDKs feed the same ADX → Grafana pipeline. New files: - perf_stats.py: Thread-safe latency histogram with sorted-list percentile calculation and atomic drain_all() for consistent summary+error snapshots - perf_config.py: All config from environment variables (RESULTS_COSMOS_URI, PERF_REPORT_INTERVAL=300s, perfdb/perfresults defaults) - perf_reporter.py: Background daemon thread that drains Stats every 5 min and upserts PerfResult documents via sync CosmosClient with AAD auth Modified files: - workload_configs.py: All configs now driven by environment variables - workload_utils.py: Added timed operation wrappers with error tracking (CosmosHttpResponseError status_code/sub_status extraction), only successful operations record latency to avoid polluting percentiles - All *_workload.py files: Integrated Stats + PerfReporter with try/finally lifecycle management Key design decisions: - Sorted-list percentiles (no hdrhistogram native dependency) - psutil for CPU/memory with /proc fallback on Linux - Cached psutil.Process() instance for accurate CPU readings - CosmosClient stored and closed properly to avoid resource leaks - sdk_language='python', sdk_version from azure.cosmos.__version__ - PPCB disabled by default - All reporter errors caught and logged as warnings (never crash workload) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
psutil is now a hard import (not optional). Removed all /proc/meminfo and /proc/self/status fallback parsing — if psutil is not installed, the import will fail immediately rather than silently degrading. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Single workload.py replaces 6 operation-specific files - WORKLOAD_OPERATIONS env var controls which ops run (read,write,query) - WORKLOAD_USE_PROXY env var enables Envoy proxy routing - WORKLOAD_USE_SYNC env var enables sync client - Validate operation names at import time with clear error - Replace manual sorted-list percentiles with hdrhistogram (O(1) record/query) - Fixed memory usage (~40KB per histogram vs unbounded list growth) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rkload.py Removed: r_workload.py, w_workload.py, r_proxy_workload.py, w_proxy_workload.py, r_w_q_workload.py, r_w_q_proxy_workload.py, r_w_q_sync_workload.py All replaced by workload.py with WORKLOAD_OPERATIONS and WORKLOAD_USE_PROXY env vars. Kept: r_w_q_with_incorrect_client_workload.py (special test case) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces r_w_q_with_incorrect_client_workload.py with an env var: WORKLOAD_SKIP_CLOSE=true creates the client without a context manager, simulating applications that don't properly close the Cosmos client. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch from time.perf_counter() * 1000 to time.perf_counter_ns() / 1_000_000 for nanosecond precision without floating-point multiplication artifacts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Infra/orchestration scripts belong in the cosmos-sdk-copilot-toolkit repo, not in the SDK repo. Workload code (workload.py, perf_*, workload_utils.py) stays here. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…istogram The pip package is 'hdrhistogram' but the Python module is 'hdrh'. Import changed from 'from hdrhistogram import HdrHistogram' to 'from hdrh.histogram import HdrHistogram'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reports COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS in the config snapshot so it's visible in the Grafana dashboard and queryable from Kusto. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The variable was used but never defined — caused pylint E0602. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, histogram clamp, safe parsing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…dictionary Move cspell words to sdk/cosmos/azure-cosmos/cspell.json instead of root .vscode/cspell.json to keep changes within cosmos folder scope. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
06a9932 to
a138997
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a performance metrics library to the Python Cosmos DB workloads that reports
PerfResultdocuments to a results Cosmos DB account, matching the Rust perf tool schema exactly. Both SDKs write to the same ADX → Grafana pipeline.New Files
perf_stats.py— Thread-safe latency histogram with sorted-list percentile calculation and atomicdrain_all()for consistent summary+error snapshotsperf_config.py— All config from environment variables (RESULTS_COSMOS_URI,PERF_REPORT_INTERVAL=300s,perfdb/perfresultsdefaults)perf_reporter.py— Background daemon thread that drains Stats every 5 minutes and upserts PerfResult documents via sync CosmosClient with AAD auth (DefaultAzureCredential)Modified Files
workload_configs.py— All configs now driven by environment variables with sensible defaultsworkload_utils.py— Added timed operation wrappers with error tracking (CosmosHttpResponseErrorstatus_code/sub_status extraction). Only successful operations record latency.*_workload.pyfiles — Integrated Stats + PerfReporter with try/finally lifecycle managementPerfResult Document Schema
Matches Rust exactly:
id,partition_key,workload_id,commit_sha,hostname,TIMESTAMPoperation,count,errors,min_ms,max_ms,mean_ms,p50_ms,p90_ms,p99_mscpu_percent,memory_bytes,system_cpu_percent,system_total_memory_bytes,system_used_memory_bytessdk_language="python",sdk_versionfromazure.cosmos.__version__config_concurrency,config_application_region,config_excluded_regions,config_ppcb_enabledKey Design Decisions
psutilfor CPU/memory with/procfallback on Linuxpsutil.Process()instance for accurate CPU readings