-
Notifications
You must be signed in to change notification settings - Fork 658
Add metrics for VM states #2649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add metrics for VM states #2649
Conversation
- Update Heartbeat.to_hash to include process_length attribute when present - Extract process_length from heartbeat payload in Riemann plugin - Attach process_length to each Riemann event (if present) - Support both symbol and string keys for payload compatibility - Add unit tests verifying process_length inclusion/omission in events - All unit tests pass (Riemann: 5/5, Heartbeat: 10/10)
|
Cross linking agent changes cloudfoundry/bosh-agent#390 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds new metrics for tracking VM states in BOSH, including unhealthy, failing, stopped, and unknown instances. The implementation introduces five new Prometheus metrics and corresponding API endpoints to expose these state counts per deployment, with agents reporting the number of running processes to detect unhealthy states.
Key changes:
- Adds
number_of_processesandjob_statetracking to the Agent class - Introduces five new metrics methods in InstanceManager for counting agents by state
- Creates corresponding API endpoints and Prometheus gauge metrics for each state
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| src/bosh-monitor/lib/bosh/monitor/agent.rb | Adds job_state and number_of_processes attributes to track agent health state |
| src/bosh-monitor/lib/bosh/monitor/instance_manager.rb | Implements five new metric calculation methods (unhealthy, failing, stopped, unknown, total_available agents) |
| src/bosh-monitor/lib/bosh/monitor/api_controller.rb | Adds five new GET endpoints to expose the VM state metrics |
| src/bosh-monitor/lib/bosh/monitor/events/heartbeat.rb | Updates heartbeat to include number_of_processes in the serialized hash |
| src/bosh-director/lib/bosh/director/metrics_collector.rb | Registers five new Prometheus gauges and fetches/populates metrics from health monitor endpoints |
| src/bosh-monitor/spec/unit/bosh/monitor/instance_manager_spec.rb | Adds comprehensive test coverage for all five new metric methods |
| src/bosh-monitor/spec/unit/bosh/monitor/api_controller_spec.rb | Adds tests for all five new API endpoints including 503 responses |
| src/bosh-monitor/spec/unit/bosh/monitor/agent_spec.rb | Tests that update_instance preserves job_state and number_of_processes |
| src/bosh-director/spec/unit/bosh/director/metrics_collector_spec.rb | Tests Prometheus metric registration and population for all new VM state metrics |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def failing_instances | ||
| agents_hash = {} | ||
| @deployment_name_to_deployments.each do |name, deployment| | ||
| agents_hash[name] = deployment.agents.count { |agent| agent.job_state == 'failing' } |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent quote style: all other methods in this file use double quotes for strings, but this method uses single quotes. Consider using double quotes for consistency.
| end | ||
| end | ||
|
|
||
| get "/unhealthy_agents" do |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent quote style in the path parameter: uses double quotes while the previous unresponsive_agents endpoint uses single quotes. Consider using single quotes consistently as seen throughout the rest of the file.
| get "/unhealthy_agents" do | |
| get '/unhealthy_agents' do |
| # Fetch and populate total_available_agents metrics | ||
| response_total = Net::HTTP.get_response('127.0.0.1', '/total_available_agents', @config.health_monitor_port) | ||
| if response_total.is_a?(Net::HTTPSuccess) | ||
| total_agent_counts = JSON.parse(response_total.body) rescue nil |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rescue nil pattern silently swallows all exceptions during JSON parsing. If the parsing fails for unexpected reasons (not just malformed JSON), this could hide bugs. Consider using a more specific rescue clause like rescue JSON::ParserError or at least logging the error when it occurs.
| total_agent_counts = JSON.parse(response_total.body) rescue nil | |
| total_agent_counts = begin | |
| JSON.parse(response_total.body) | |
| rescue JSON::ParserError => e | |
| @logger.warn("Failed to parse total_available_agents response as JSON: #{e.message}") if @logger | |
| nil | |
| end |
| # Fetch and populate stopped_instances metrics | ||
| response_stopped = Net::HTTP.get_response('127.0.0.1', '/stopped_instances', @config.health_monitor_port) | ||
| if response_stopped.is_a?(Net::HTTPSuccess) | ||
| stopped_counts = JSON.parse(response_stopped.body) rescue nil |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rescue nil pattern silently swallows all exceptions during JSON parsing. Consider using a more specific rescue clause like rescue JSON::ParserError for consistency with error handling best practices.
| # Fetch and populate unknown_instances metrics | ||
| response_unknown = Net::HTTP.get_response('127.0.0.1', '/unknown_instances', @config.health_monitor_port) | ||
| if response_unknown.is_a?(Net::HTTPSuccess) | ||
| unknown_counts = JSON.parse(response_unknown.body) rescue nil |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rescue nil pattern silently swallows all exceptions during JSON parsing. Consider using a more specific rescue clause like rescue JSON::ParserError for consistency with error handling best practices.
| if response_stopped.is_a?(Net::HTTPSuccess) | ||
| stopped_counts = JSON.parse(response_stopped.body) rescue nil | ||
| if stopped_counts.is_a?(Hash) | ||
| existing_stopped_deployment_names = @stopped_instances.values.map { |key, _| key[:name] } | ||
|
|
||
| stopped_counts.each do |deployment, count| | ||
| @stopped_instances.set(count, labels: { name: deployment }) | ||
| end | ||
|
|
||
| removed_stopped_deployments = existing_stopped_deployment_names - stopped_counts.keys | ||
| removed_stopped_deployments.each do |deployment| | ||
| @stopped_instances.set(0, labels: { name: deployment }) | ||
| end | ||
| end | ||
| end | ||
|
|
||
| # Fetch and populate unknown_instances metrics | ||
| response_unknown = Net::HTTP.get_response('127.0.0.1', '/unknown_instances', @config.health_monitor_port) | ||
| if response_unknown.is_a?(Net::HTTPSuccess) | ||
| unknown_counts = JSON.parse(response_unknown.body) rescue nil | ||
| if unknown_counts.is_a?(Hash) | ||
| existing_unknown_deployment_names = @unknown_instances.values.map { |key, _| key[:name] } | ||
|
|
||
| unknown_counts.each do |deployment, count| | ||
| @unknown_instances.set(count, labels: { name: deployment }) | ||
| end | ||
|
|
||
| removed_unknown_deployments = existing_unknown_deployment_names - unknown_counts.keys | ||
| removed_unknown_deployments.each do |deployment| | ||
| @unknown_instances.set(0, labels: { name: deployment }) | ||
| end | ||
| end |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All five new metric collection blocks follow a nearly identical pattern with only minor differences (endpoint name, metric variable, method call). This represents significant code duplication. Consider extracting this into a helper method that takes the endpoint name and metric variable as parameters to improve maintainability.
| if response_stopped.is_a?(Net::HTTPSuccess) | |
| stopped_counts = JSON.parse(response_stopped.body) rescue nil | |
| if stopped_counts.is_a?(Hash) | |
| existing_stopped_deployment_names = @stopped_instances.values.map { |key, _| key[:name] } | |
| stopped_counts.each do |deployment, count| | |
| @stopped_instances.set(count, labels: { name: deployment }) | |
| end | |
| removed_stopped_deployments = existing_stopped_deployment_names - stopped_counts.keys | |
| removed_stopped_deployments.each do |deployment| | |
| @stopped_instances.set(0, labels: { name: deployment }) | |
| end | |
| end | |
| end | |
| # Fetch and populate unknown_instances metrics | |
| response_unknown = Net::HTTP.get_response('127.0.0.1', '/unknown_instances', @config.health_monitor_port) | |
| if response_unknown.is_a?(Net::HTTPSuccess) | |
| unknown_counts = JSON.parse(response_unknown.body) rescue nil | |
| if unknown_counts.is_a?(Hash) | |
| existing_unknown_deployment_names = @unknown_instances.values.map { |key, _| key[:name] } | |
| unknown_counts.each do |deployment, count| | |
| @unknown_instances.set(count, labels: { name: deployment }) | |
| end | |
| removed_unknown_deployments = existing_unknown_deployment_names - unknown_counts.keys | |
| removed_unknown_deployments.each do |deployment| | |
| @unknown_instances.set(0, labels: { name: deployment }) | |
| end | |
| end | |
| populate_instance_metrics('/stopped_instances', @stopped_instances) | |
| populate_instance_metrics('/unknown_instances', @unknown_instances) | |
| end | |
| def populate_instance_metrics(endpoint, metric) | |
| response = Net::HTTP.get_response('127.0.0.1', endpoint, @config.health_monitor_port) | |
| return unless response.is_a?(Net::HTTPSuccess) | |
| counts = JSON.parse(response.body) rescue nil | |
| return unless counts.is_a?(Hash) | |
| existing_deployment_names = metric.values.map { |key, _| key[:name] } | |
| counts.each do |deployment, count| | |
| metric.set(count, labels: { name: deployment }) | |
| end | |
| removed_deployments = existing_deployment_names - counts.keys | |
| removed_deployments.each do |deployment| | |
| metric.set(0, labels: { name: deployment }) |
| end | ||
|
|
||
| # Fetch and populate unhealthy_agents metrics | ||
| response_unhealthy = Net::HTTP.get_response("127.0.0.1", "/unhealthy_agents", @config.health_monitor_port) |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent quote style: this line uses double quotes while line 188 below uses single quotes. Consider using a consistent quote style throughout the method.
| response_unhealthy = Net::HTTP.get_response("127.0.0.1", "/unhealthy_agents", @config.health_monitor_port) | |
| response_unhealthy = Net::HTTP.get_response('127.0.0.1', '/unhealthy_agents', @config.health_monitor_port) |
| if response_total.is_a?(Net::HTTPSuccess) | ||
| total_agent_counts = JSON.parse(response_total.body) rescue nil | ||
| if total_agent_counts.is_a?(Hash) | ||
| existing_total_deployment_names = @total_available_agents.values.map { |key, _| key[:name] } | ||
|
|
||
| total_agent_counts.each do |deployment, count| | ||
| @total_available_agents.set(count, labels: { name: deployment }) | ||
| end | ||
|
|
||
| removed_total_deployments = existing_total_deployment_names - total_agent_counts.keys | ||
| removed_total_deployments.each do |deployment| | ||
| @total_available_agents.set(0, labels: { name: deployment }) | ||
| end | ||
| end | ||
| end | ||
|
|
||
| # Fetch and populate failing_instances metrics | ||
| response_failing = Net::HTTP.get_response('127.0.0.1', '/failing_instances', @config.health_monitor_port) | ||
| if response_failing.is_a?(Net::HTTPSuccess) | ||
| failing_counts = JSON.parse(response_failing.body) rescue nil | ||
| if failing_counts.is_a?(Hash) | ||
| existing_failing_deployment_names = @failing_instances.values.map { |key, _| key[:name] } | ||
|
|
||
| failing_counts.each do |deployment, count| | ||
| @failing_instances.set(count, labels: { name: deployment }) | ||
| end | ||
|
|
||
| removed_failing_deployments = existing_failing_deployment_names - failing_counts.keys | ||
| removed_failing_deployments.each do |deployment| | ||
| @failing_instances.set(0, labels: { name: deployment }) | ||
| end | ||
| end | ||
| end | ||
|
|
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling pattern differs from the unresponsive_agents block above. The first block uses early returns after each validation check, while the new metrics blocks wrap all logic inside nested if statements. Consider using the same early-return pattern for consistency and better readability.
| if response_total.is_a?(Net::HTTPSuccess) | |
| total_agent_counts = JSON.parse(response_total.body) rescue nil | |
| if total_agent_counts.is_a?(Hash) | |
| existing_total_deployment_names = @total_available_agents.values.map { |key, _| key[:name] } | |
| total_agent_counts.each do |deployment, count| | |
| @total_available_agents.set(count, labels: { name: deployment }) | |
| end | |
| removed_total_deployments = existing_total_deployment_names - total_agent_counts.keys | |
| removed_total_deployments.each do |deployment| | |
| @total_available_agents.set(0, labels: { name: deployment }) | |
| end | |
| end | |
| end | |
| # Fetch and populate failing_instances metrics | |
| response_failing = Net::HTTP.get_response('127.0.0.1', '/failing_instances', @config.health_monitor_port) | |
| if response_failing.is_a?(Net::HTTPSuccess) | |
| failing_counts = JSON.parse(response_failing.body) rescue nil | |
| if failing_counts.is_a?(Hash) | |
| existing_failing_deployment_names = @failing_instances.values.map { |key, _| key[:name] } | |
| failing_counts.each do |deployment, count| | |
| @failing_instances.set(count, labels: { name: deployment }) | |
| end | |
| removed_failing_deployments = existing_failing_deployment_names - failing_counts.keys | |
| removed_failing_deployments.each do |deployment| | |
| @failing_instances.set(0, labels: { name: deployment }) | |
| end | |
| end | |
| end | |
| return unless response_total.is_a?(Net::HTTPSuccess) | |
| total_agent_counts = JSON.parse(response_total.body) rescue nil | |
| return unless total_agent_counts.is_a?(Hash) | |
| existing_total_deployment_names = @total_available_agents.values.map { |key, _| key[:name] } | |
| total_agent_counts.each do |deployment, count| | |
| @total_available_agents.set(count, labels: { name: deployment }) | |
| end | |
| removed_total_deployments = existing_total_deployment_names - total_agent_counts.keys | |
| removed_total_deployments.each do |deployment| | |
| @total_available_agents.set(0, labels: { name: deployment }) | |
| end | |
| # Fetch and populate failing_instances metrics | |
| response_failing = Net::HTTP.get_response('127.0.0.1', '/failing_instances', @config.health_monitor_port) | |
| return unless response_failing.is_a?(Net::HTTPSuccess) | |
| failing_counts = JSON.parse(response_failing.body) rescue nil | |
| return unless failing_counts.is_a?(Hash) | |
| existing_failing_deployment_names = @failing_instances.values.map { |key, _| key[:name] } | |
| failing_counts.each do |deployment, count| | |
| @failing_instances.set(count, labels: { name: deployment }) | |
| end | |
| removed_failing_deployments = existing_failing_deployment_names - failing_counts.keys | |
| removed_failing_deployments.each do |deployment| | |
| @failing_instances.set(0, labels: { name: deployment }) | |
| end |
| # Fetch and populate failing_instances metrics | ||
| response_failing = Net::HTTP.get_response('127.0.0.1', '/failing_instances', @config.health_monitor_port) | ||
| if response_failing.is_a?(Net::HTTPSuccess) | ||
| failing_counts = JSON.parse(response_failing.body) rescue nil |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rescue nil pattern silently swallows all exceptions during JSON parsing. Consider using a more specific rescue clause like rescue JSON::ParserError for consistency with error handling best practices.
| stub_request(:get, /total_available_agents/) | ||
| .to_return(status: 200, body: JSON.dump("flaky_deployment" => 3, "good_deployment" => 2)) | ||
| stub_request(:get, /failing_instances/) | ||
| .to_return(status: 200, body: JSON.dump("flaky_deployment" => 1, "good_deployment" => 0)) | ||
| stub_request(:get, /stopped_instances/) | ||
| .to_return(status: 200, body: JSON.dump("flaky_deployment" => 0, "good_deployment" => 0)) | ||
| stub_request(:get, /unknown_instances/) | ||
| .to_return(status: 200, body: JSON.dump("flaky_deployment" => 0, "good_deployment" => 1)) |
Copilot
AI
Dec 18, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent indentation: these stub_request calls are indented with extra spaces (10 spaces total) while the previous stub_request calls use 6 spaces. All stub_request calls at this level should use consistent indentation.
What is this change about?
This adds new bosh metrics regarding the VM states. Initially the idea was proposed in this PR that which was reverted here due to breaking the system design.
Please provide contextual information.
Initial PR which was reverted defined the state "unhealthy" if the VM state is not "running". In this PR, each state has its own metric, and there is additional metric "unhealthy" when the VM state is running but doesn't have any processes. This requires modification in the agent heartbeat to include the number of processes. The corresponding PR for the agent heartbeat is here.
The new metrics are:
What tests have you run against this PR?
Ran the unit tests and also tested a bosh dev release on a development environment with the modified bosh-agent.
How should this change be described in bosh release notes?
Bosh now emits different states of VM as metrics, which can be used to monitor the state of the instances per deployment over time
Does this PR introduce a breaking change?
No
Tag your pair, your PM, and/or team!
@DennisAhausSAP