Skip to content

Conversation

@yashanand1910
Copy link
Contributor

@yashanand1910 yashanand1910 commented Jan 24, 2025

This PR upgrades the existing code for Cedana C/R to use the new API. Cedana has a completely redesigned architecture and will soon be v1.0. Existing features should work as is.

The following changes/improvements are in this PR:

  1. Removed extraneous logic for modifying the spec for adding GPU C/R support, this is not needed anymore.
  2. Removed cedanaClient.Manage call after spawning a container. Now, if a runc container needs C/R support, it can be spawned directly using Cedana, with all the same runc.CreateOpts. In fact, it should be possible to eliminate the existing runc client eventually, and just use Cedana, as it provides an abstraction (called job) for spawning a workload using any runtime: simple process, runc, kata, containerd, etc, making it easier to replace runtimes.
  3. Fixed new container ID not being used on restore.
  4. Removed constraint of using a consoleWriter requiring the runc container to be spawned with spec.Process.Terminal = true. The old simple outputWriter is used.
  5. Simplified client code in cedana.go. A friendly Client package is exported by Cedana that is used instead.
  6. CRIU, support for runc & GPU are bundled as invidivual 'plugins' to Cedana. Dockerfile.worker was simplified to download these.
  7. Lot more config options (for default behaviour):
    cedana:
      # Can be unix, tcp, vsock
      protocol: unix
      address: /run/cedana.sock
      log_level: debug
      # Connection details to your cedana endpoint
      connection:
        url: ""
        auth_token: ""
      checkpoint:
        # Default dir to write/stream checkpoints to
        dir: /data/checkpoints
        # Can be one of: none, tar, lz4, gzip, zlib
        compression: lz4
        # Number of parallel streams for streaming checkpoints (0 = off)
        stream: 4
      db:
        # Use remote DB on the cedana endpoint
        remote: true
      profiling:
        # Receive profiling info in the gRPC trailer
        enabled: false
        precision: auto
      criu:
        # Keep the job running after checkpoint
        leave_running: true

For complete set of options, check out https://github.com/cedana/cedana/blob/bfd99a2c7ec2c611adbf4c1697e4e811344f88dd/pkg/config/types.go

Upcoming improvements exploiting new features of Cedana:

  1. Streaming checkpoints (for both checkpoint/restore) directly to/from the filesystem, with on-the-fly compression, eliminating extra filesystem read/writes. Streaming makes it possible to directly stream compression checkpoint to remote storage. We are still testing GPU C/R with this and expecting it to be ready by next week. Some early results are showing linear speedup increase per number of parallel streams used.
  2. Run the Cedana daemon as a daemonset on the host, instead of spawning per worker. The daemon is designed for Kubernetes where it can handle several concurrent C/R requests.

@yashanand1910 yashanand1910 marked this pull request as ready for review January 28, 2025 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant