Skip to content

[FLINK-39376] [checkpoint]Add IP Address of Subtask Running Checkpoint to Flink WebU#27876

Open
wangxiaojing wants to merge 1 commit intoapache:masterfrom
wangxiaojing:show-TaskManager-IP-address-in-checkpoint-subtask-statistics
Open

[FLINK-39376] [checkpoint]Add IP Address of Subtask Running Checkpoint to Flink WebU#27876
wangxiaojing wants to merge 1 commit intoapache:masterfrom
wangxiaojing:show-TaskManager-IP-address-in-checkpoint-subtask-statistics

Conversation

@wangxiaojing
Copy link
Copy Markdown

What is the purpose of the change

This pull request adds the TaskManager IP address to the per-subtask checkpoint statistics displayed in the Flink Web UI and exposed via the REST API.

When diagnosing slow or failing checkpoints, operators need to identify which TaskManager host a particular subtask was running on. Without this information, correlating checkpoint latency with specific nodes (e.g., machines with disk I/O bottlenecks, network issues, or GC
pressure) requires cross-referencing subtask indices with the TaskManager assignment through a separate UI path. This change surfaces the IP address directly in the subtask checkpoint statistics table, reducing time-to-diagnose for checkpoint-related incidents.

Brief change log

  • SubtaskStateStats: Added ip field (nullable String) with getter to carry the TaskManager IP address per subtask. Updated serialVersionUID to reflect the serialization format change.
  • PendingCheckpoint: When a subtask acknowledges a checkpoint, extract the TaskManager's IP via TaskManagerLocation.getAddress().getHostAddress() and store it in the resulting SubtaskStateStats.
  • DefaultCheckpointStatsTracker: Pass null for ip on the internal stats-only code path where no execution context is available.
  • SubtaskCheckpointStatistics.CompletedSubtaskCheckpointStatistics: Added optional ip JSON field to the REST API response. The field is backwards compatible — clients that do not recognise it will ignore it.
  • TaskCheckpointStatisticDetailsHandler: Propagate SubtaskStateStats.getIp() into the REST response object.
  • Web UI (job-checkpoints-subtask component): Added a sortable "IP Address" column to the subtask checkpoint statistics table. The sort function correctly normalises IPv4 segments for lexicographic ordering. Null/missing values are displayed as -.

Verifying this change

This change added tests and can be verified as follows:

  • SubtaskStateStatsTest: Extended existing getter tests to assert that getIp() returns the value passed at construction time, and that the value survives Java serialization round-trips.
  • PendingCheckpointTest#testAcknowledgeTaskCapturesTaskManagerIp: New test. Creates a mock ExecutionVertex backed by a real TaskManagerLocation (), calls PendingCheckpoint.acknowledgeTask(), and asserts that
    PendingCheckpointStats.getLatestAcknowledgedSubtaskStats().getIp() equals "". This directly validates the extraction path through location.getAddress().getHostAddress().
  • TaskCheckpointStatisticsWithSubtaskDetailsTest: Updated JSON marshalling/unmarshalling round-trip test to include a realistic IP value , validating that the ip field is correctly serialized and deserialized via Jackson.
  • DefaultCheckpointStatsTrackerTest, PendingCheckpointStatsTest, TaskStateStatsTest, CompletedCheckpointTest: Updated all SubtaskStateStats construction call sites to pass the new ip parameter with a consistent null sentinel.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @public(Evolving): no
  • The serializers: yes — SubtaskStateStats is Serializable; a new ip field was added . Old serialized forms of SubtaskStateStats (checkpoint statistics history) are not compatible with the new class.
    This affects checkpoint stats display after a rolling upgrade but does not affect checkpoint data correctness.
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes — the change touches PendingCheckpoint.acknowledgeTask(), which is on the checkpoint acknowledgement path in the JobManager. The change is read-only
    with respect to checkpoint correctness (it only reads TaskManagerLocation metadata already available on the ExecutionVertex) and adds no synchronisation overhead.
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? not documented (the IP address column is self-explanatory in the Web UI; no changes to the public REST API contract beyond adding an optional backwards-compatible field)

@flinkbot
Copy link
Copy Markdown
Collaborator

flinkbot commented Apr 1, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants