Skip to content

Conversation

@s4heid
Copy link
Contributor

@s4heid s4heid commented Dec 1, 2025

  • Run first-boot tasks via systemd so sshd never races with host-key regeneration. The old rc.local script ran after network.target, but in parallel with other regular system services, like ssh.service. Therefore, ssh.service often started (and restarted) while /root/firstboot.sh was deleting keys. cloud-init’s set-passwords module made this worse by restarting ssh mid-run.
  • Replace rc.local with a oneshot firstboot.service (delete keys, create new keys, reconfigure sysstat) that runs Before=ssh.service and leaves the /root/firstboot_done file as a marker.
  • Add a cloud-config.service drop-in so cloud-init's config stage waits for firstboot.service, and
  • Update walinuxagent.service to wait for firstboot.service, ensuring ssh keys have been regenerated. This guarantees sshd, cloud-init, and WALinuxAgent all start only after the first-boot tasks succeed.

Resolves #458

 * Run first-boot tasks via systemd so sshd never races with host-key
   regeneration. The old `rc.local` script ran after network.target, but
   in parallel with other regular system services, like ssh.service.
   Therefore, ssh.service often started (and restarted) while
   `/root/firstboot.sh` was deleting keys. cloud-init’s set-passwords
   module made this worse by restarting ssh mid-run.
 * Replace `rc.local` with a oneshot firstboot.service (delete keys,
   create new keys, reconfigure sysstat) that runs Before=ssh.service
   and leaves the `/root/firstboot_done` file as a marker.
 * Add a cloud-config.service drop-in so cloud-init's config stage waits
   for firstboot.service, and
 * Update walinuxagent.service to wait for firstboot.service, ensuring
   ssh keys have been regenerated. This guarantees sshd, cloud-init, and
   WALinuxAgent all start only after the first-boot tasks succeed.
@s4heid
Copy link
Contributor Author

s4heid commented Dec 14, 2025

Warning

It's important to be aware that this change could affect how the ssh service behaves. If the firstboot script was intended only for host key regeneration, using the ssh-keygen -A command should be sufficient.
However, if we want to continue using the dpk-reconfigure command, it becomes a bit more challenging to ensure that firstboot.service runs before ssh.service, since dpk-reconfigure attempts to restart ssh.service, which would result in a deadlock. One possible solution would be to do something like this:

[Service]
Type=oneshot
ExecStartPre=/bin/sh -c '{ echo \"#!/bin/sh\nexit 101\"; } > /usr/sbin/policy-rc.d && chmod +x /usr/sbin/policy-rc.d'
ExecStartPre=/bin/sh -c '/bin/rm -f /etc/ssh/ssh_host*key*'
ExecStart=/usr/sbin/dpkg-reconfigure -fnoninteractive -pcritical openssh-server
ExecStart=/usr/sbin/dpkg-reconfigure -fnoninteractive sysstat
ExecStartPost=/bin/rm -f /usr/sbin/policy-rc.d

@s4heid s4heid marked this pull request as ready for review December 14, 2025 17:46
@rkoster rkoster requested review from a team, Copilot, lnguyen and ystros and removed request for a team December 18, 2025 16:11
@rkoster rkoster moved this from Inbox to Pending Review | Discussion in Foundational Infrastructure Working Group Dec 18, 2025
@rkoster rkoster requested a review from ramonskie December 18, 2025 16:13
@ramonskie
Copy link
Contributor

we should not introduce this within jammy.
as it will break warden stemcells. because if the host does not have systemd the vm with the jammy stemcell will fail.

we currently have similar issues on noble as well as we have set bosh-agent to use systemd

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition where SSH daemon could start before host keys are regenerated during first boot, causing provisioning failures. The fix replaces the rc.local-based firstboot mechanism with a proper systemd service that establishes explicit ordering dependencies.

Key Changes

  • Introduces firstboot.service (oneshot systemd unit) that runs before ssh.service to regenerate host keys and configure sysstat
  • Removes the legacy rc.local script and firstboot.sh in favor of systemd-native orchestration
  • Updates walinuxagent.service to depend on firstboot.service completion instead of polling for the marker file

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
stemcell_builder/stages/base_ubuntu_firstboot/assets/etc/systemd/system/firstboot.service New systemd oneshot service that deletes old SSH keys, generates new ones, and reconfigures sysstat before SSH starts
stemcell_builder/stages/base_ubuntu_firstboot/assets/etc/rc.local Removed legacy rc.local script that previously executed firstboot tasks
stemcell_builder/stages/base_ubuntu_firstboot/assets/root/firstboot.sh Removed shell script containing firstboot logic, now handled by systemd service
stemcell_builder/stages/base_ubuntu_firstboot/apply.sh Updated to install and enable the new firstboot.service instead of copying rc.local and firstboot.sh scripts
stemcell_builder/stages/system_azure_init/assets/etc/systemd/system/cloud-config.service.d/firstboot-blocker.conf New drop-in configuration ensuring cloud-init waits for firstboot.service completion
stemcell_builder/stages/system_azure_init/assets/etc/waagent/walinuxagent.service Replaced polling loop with proper systemd dependency on firstboot.service
stemcell_builder/stages/system_azure_init/apply.sh Added installation of cloud-config.service drop-in to ensure proper ordering

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

[Unit]
Description=Run first boot tasks
ConditionPathExists=!/root/firstboot_done
Before=ssh.service
Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding "Before=sshd-keygen.service" to ensure firstboot.service completes key regeneration before the system's default SSH key generation service (sshd-keygen.service) attempts to run. While both services likely check for existing keys, explicitly ordering them would prevent any potential race condition between the two key generation mechanisms.

Suggested change
Before=ssh.service
Before=sshd-keygen.service ssh.service

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +13
ExecStart=/usr/bin/ssh-keygen -A -v
ExecStartPost=/usr/sbin/dpkg-reconfigure -fnoninteractive sysstat
ExecStartPost=/usr/bin/touch /root/firstboot_done
RemainAfterExit=yes

Copy link

Copilot AI Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ssh-keygen command in ExecStart fails, the ExecStartPost commands (including the creation of the marker file) will still execute. This could leave the system in an inconsistent state where firstboot_done exists but SSH keys were never generated. Consider moving the marker file creation to a separate ExecStartPost command that only runs on success, or use a more robust approach to ensure the service fails if key generation fails.

Suggested change
ExecStart=/usr/bin/ssh-keygen -A -v
ExecStartPost=/usr/sbin/dpkg-reconfigure -fnoninteractive sysstat
ExecStartPost=/usr/bin/touch /root/firstboot_done
RemainAfterExit=yes
ExecStart=/bin/sh -c '/usr/bin/ssh-keygen -A -v && /usr/bin/touch /root/firstboot_done'
ExecStartPost=/usr/sbin/dpkg-reconfigure -fnoninteractive sysstat
RemainAfterExit=yes

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Pending Review | Discussion

Development

Successfully merging this pull request may close these issues.

ssh.service failures on azure stemcells

2 participants