Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Aug 14, 2025

Problem

The DCS plugin only attempts to reconnect to the Subscribe stream if it has been already established during the plugin lifetime. However, if the ecs-dcs-gateway is unavailable during plugin initialization, the plugin will never attempt to reconnect, leaving it permanently disconnected.

This issue was observed in production on 14/04/2025 at 13:31 when the DCS gateway was unavailable during AliECS core startup.

Root Cause

In core/integration/dcs/plugin.go, the Init() method performs an initial subscription attempt (lines 290-293). If this fails, the method returns an error and the reconnection goroutine (lines 294-348) is never started:

evStream, err := p.dcsClient.Subscribe(context.Background(), in, grpc.EmptyCallOption{})
if err != nil {
    return fmt.Errorf("failed to subscribe to DCS service on %s, possible network issue or DCS gateway malfunction", viper.GetString("dcsServiceEndpoint"))
}
// Reconnection goroutine only starts if initial subscription succeeds
go func() { /* reconnection logic */ }()

Solution

This PR restructures the initialization logic to ensure the reconnection goroutine always starts, regardless of initial connection status:

  1. Always start reconnection goroutine: The goroutine now starts even if the initial subscription fails
  2. Move subscription into goroutine: Initial subscription attempts are handled within the reconnection loop
  3. Resilient initialization: Init() succeeds even when the DCS gateway is unavailable
  4. Continuous reconnection: The plugin continuously attempts reconnection with 3-second intervals until successful

Key Changes

  • Removed early return on subscription failure that prevented goroutine startup
  • Restructured goroutine to handle both initial and subsequent subscription attempts
  • Added proper stream state management (evStream = nil on connection failures)
  • Enhanced logging for better observability of connection attempts

Testing

  • ✅ All existing DCS tests pass (no regression)
  • ✅ New test TestPluginInitWithUnavailableGateway validates the fix
  • ✅ Manual testing confirms continuous reconnection attempts
  • ✅ Core binary builds successfully

Impact

This fix ensures the DCS plugin will reliably reconnect to the ecs-dcs-gateway after it becomes available, preventing the production issue experienced when the gateway is unavailable during startup. The plugin now gracefully handles network outages and service restarts without requiring AliECS core restarts.

Warning

Firewall rules blocked me from connecting to one or more addresses

I tried to connect to the following addresses, but was blocked by firewall rules:

  • esm.ubuntu.com
    • Triggering command: /usr/lib/apt/methods/https (dns block)

If you need me to access, download, or install something from one of these locations, you can either:


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits August 14, 2025 09:07
Co-authored-by: knopers8 <14327588+knopers8@users.noreply.github.com>
- Move subscription logic into goroutine to enable continuous retry
- Always start reconnection goroutine regardless of initial connection state
- Plugin initialization now succeeds even if DCS gateway is unavailable
- Add comprehensive test for unavailable gateway scenario
- Resolves issue where plugin never attempted reconnection after init failure

Co-authored-by: knopers8 <14327588+knopers8@users.noreply.github.com>
Copilot AI changed the title [WIP] Attempt reconnecting to ecs-dcs gateway at core startup. The DCS plugin located in core/integration/dcs attempts to reconnect to the Subscribe stream only if it has been already established during the plugin lifetime. However, if ecs-dcs-gateway is un... Fix DCS plugin reconnection during core startup when gateway is unavailable Aug 14, 2025
Copilot AI requested a review from knopers8 August 14, 2025 09:16
@knopers8
Copy link
Collaborator

@copilot remove the proto-generated go files from the PR. They have nothing to do with the aim of the PR.

Co-authored-by: knopers8 <14327588+knopers8@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Aug 14, 2025

@copilot remove the proto-generated go files from the PR. They have nothing to do with the aim of the PR.

Removed all proto-generated files from the PR. The changes now only include the DCS plugin fix (core/integration/dcs/plugin.go and core/integration/dcs/plugin_reconnect_test.go). Commit: 22d0313

Copy link
Collaborator

@knopers8 knopers8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that actually looks correct and works well (tested on staging)

@knopers8 knopers8 marked this pull request as ready for review August 18, 2025 16:15
@knopers8 knopers8 requested a review from justonedev1 as a code owner August 18, 2025 16:15
@knopers8 knopers8 changed the title Fix DCS plugin reconnection during core startup when gateway is unavailable OCTRL-1008 Attempt reconnecting to ecs-dcs gateway at core startup Aug 18, 2025
@knopers8
Copy link
Collaborator

The commits should be squashed when merging.

@knopers8 knopers8 merged commit 5b31372 into master Aug 19, 2025
4 checks passed
@knopers8 knopers8 deleted the copilot/fix-4afbbf43-18fe-4394-a391-ca2dc4cd694b branch August 19, 2025 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants