Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
169 changes: 93 additions & 76 deletions fusion_docs/troubleshooting/fusion-snapshots.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,14 @@ To resolve this issue:
- Lower memory requested by tasks
- Process smaller data chunks
- Set `process.resourceLimits` to enforce limits:
```groovy

```groovy
// AWS Batch example
process.resourceLimits = [cpus: 32, memory: '60.GB']

// Google Batch example (more conservative for 30s window)
process.resourceLimits = [cpus: 16, memory: '20.GB']
```
```

1. Increase network bandwidth:

Expand All @@ -47,18 +48,19 @@ To resolve this issue:
- Avoid ARM64 instances if checkpoints are failing.

1. Configure retry strategy:
```groovy
process {
maxRetries = 2
errorStrategy = {
if (task.exitStatus == 175) {
return 'retry'
} else {
return 'terminate'
}
}
}
```

```groovy
process {
maxRetries = 2
errorStrategy = {
if (task.exitStatus == 175) {
return 'retry'
} else {
return 'terminate'
}
}
}
```

See [AWS Batch instance selection](../guide/snapshots/aws#selecting-an-ec2-instance) or [Google Batch best practices](../guide/snapshots/gcp) for recommended configurations.

Expand All @@ -80,10 +82,12 @@ This issue can occur due to:
To resolve this issue:

1. Check if previous checkpoint completed:

- Review logs for "Dumping finished successfully".
- If the "Dumping finished successfully" message is missing, it means the previous checkpoint timed out with a `175` exit error.

1. Verify checkpoint data exists:

- Check that the `.fusion/dump/` work directory contains checkpoint files.
- Ensure that the S3/GCS bucket is accessible.
- If the bucket is missing, open a support ticket. See [Getting help](#getting-help) for more information.
Expand All @@ -109,11 +113,13 @@ This issue can occur due to:
To resolve this issue:

1. For AWS Batch (120-second window):

- Use instances with 5:1 or better memory:bandwidth ratio.
- Use `x86_64` instances for incremental snapshot support (`c6id`, `m6id`, `r6id` families).
- Check architecture: `uname -m`

1. For Google Batch (30-second window):

- Use `x86_64` instances (mandatory for larger workloads).
- Use more conservative memory limits.
- Consider smaller instance types with better ratios.
Expand All @@ -137,21 +143,24 @@ This issue can occur due to:
To resolve this issue:

1. Split large tasks:

- Break into smaller, checkpointable units.
- Process data in chunks.

1. Switch to `x86_64` instances:

- Essential for Google Batch.
- Recommended for AWS Batch tasks > 40 GiB.

1. Adjust memory limits:
```groovy
// For AWS Batch
process.resourceLimits = [cpus: 32, memory: '60.GB']

// For Google Batch (more conservative)
process.resourceLimits = [cpus: 16, memory: '20.GB']
```
```groovy
// For AWS Batch
process.resourceLimits = [cpus: 32, memory: '60.GB']

// For Google Batch (more conservative)
process.resourceLimits = [cpus: 16, memory: '20.GB']
```

## SSL/TLS connection errors after restore

Expand All @@ -160,6 +169,7 @@ Applications fail after restore with connection errors, especially HTTPS connect
This issue occurs when applications use HTTPS connections, as CRIU cannot preserve encrypted TCP connections (SSL/TLS).

To resolve this issue, configure TCP close mode to drop connections during checkpoint:

```groovy
process.containerOptions = '-e FUSION_SNAPSHOTS_TCP_MODE=close'
```
Expand All @@ -180,64 +190,71 @@ To diagnose checkpoint problems:

- Check `.command.log` in the task work directory for Fusion Snapshots messages (prefixed with timestamps).

:::tip
Enable `debug` logging for more details.
```groovy
process.containerOptions = '-e FUSION_SNAPSHOT_LOG_LEVEL=debug'
```
:::

1. Inspect your checkpoint data:

1. Open the `.fusion/dump/` folder:
```console
.fusion/dump/
├── 1/ # First dump
│ ├── pre_*.log # Pre-dump log (if incremental)
│ └── <CRIU files>
├── 2/ # Second dump
│ ├── pre_*.log
│ └── <CRIU files>
├── 3/ # Third dump (full)
│ ├── dump_*.log # Full dump log
│ ├── restore_*.log # Restore log (if restored)
│ └── <CRIU files>
└── dump_metadata # Metadata tracking all dumps
```

1. For incremental dumps (PRE type), check for success markers at the end of the `pre_*.log` file:
```console
(66.525687) page-pipe: Killing page pipe
(66.563939) irmap: Running irmap pre-dump
(66.610871) Writing stats
(66.658902) Pre-dumping finished successfully
```
:::tip
Enable `debug` logging for more details.

1. For full dumps (FULL type), check for success markers at the end of the `dump_*.log` file:
```console
(25.867099) Unseizing 90 into 2
(27.160829) Writing stats
(27.197458) Dumping finished successfully
```
```groovy
process.containerOptions = '-e FUSION_SNAPSHOT_LOG_LEVEL=debug'
```

1. If the log ends abruptly without success message, check the last timestamp:
```console
(121.37535) Dumping path for 329 fd via self 353 [/path/to/file.tmp]
(121.65146) 90 fdinfo 330: pos: 0x4380000 flags: 100000/0
# Log truncated - instance was reclaimed before dump completed
```
:::

- AWS Batch: Timestamps near 120 seconds indicate instance terminated during dump.
- Google Batch: Timestamps near 30 seconds indicate instance terminated during dump.

Cause: Task memory too large or bandwidth too low for reclamation window.
1. Inspect your checkpoint data:

1. For restore operations, check for a success marker at the end of the `restore_*.log` file:
```console
(145.81974) Running pre-resume scripts
(145.81994) Restore finished successfully. Tasks resumed.
(145.82001) Writing stats
```
1. Open the `.fusion/dump/` folder:

```console
.fusion/dump/
├── 1/ # First dump
│ ├── pre_*.log # Pre-dump log (if incremental)
│ └── <CRIU files>
├── 2/ # Second dump
│ ├── pre_*.log
│ └── <CRIU files>
├── 3/ # Third dump (full)
│ ├── dump_*.log # Full dump log
│ ├── restore_*.log # Restore log (if restored)
│ └── <CRIU files>
└── dump_metadata # Metadata tracking all dumps
```

1. For incremental dumps (PRE type), check for success markers at the end of the `pre_*.log` file:

```console
(66.525687) page-pipe: Killing page pipe
(66.563939) irmap: Running irmap pre-dump
(66.610871) Writing stats
(66.658902) Pre-dumping finished successfully
```

1. For full dumps (FULL type), check for success markers at the end of the `dump_*.log` file:

```console
(25.867099) Unseizing 90 into 2
(27.160829) Writing stats
(27.197458) Dumping finished successfully
```

1. If the log ends abruptly without success message, check the last timestamp:

```console
(121.37535) Dumping path for 329 fd via self 353 [/path/to/file.tmp]
(121.65146) 90 fdinfo 330: pos: 0x4380000 flags: 100000/0
# Log truncated - instance was reclaimed before dump completed
```

- AWS Batch: Timestamps near 120 seconds indicate instance terminated during dump.
- Google Batch: Timestamps near 30 seconds indicate instance terminated during dump.

Cause: Task memory too large or bandwidth too low for reclamation window.

1. For restore operations, check for a success marker at the end of the `restore_*.log` file:

```console
(145.81974) Running pre-resume scripts
(145.81994) Restore finished successfully. Tasks resumed.
(145.82001) Writing stats
```

1. Verify your configuration:

Expand All @@ -250,8 +267,8 @@ To diagnose checkpoint problems:

1. Test with different instance types. If uncertain:

- Run the same task with different instance types that have better disk iops and bandwidth guarantees and verify if Fusions Snapshots work there.
- Decrease memory usage to a manageable amount.
- Run the same task with different instance types that have better disk iops and bandwidth guarantees and verify if Fusions Snapshots work there.
- Decrease memory usage to a manageable amount.

:::tip
For detailed information about error codes and logging, see [Error reference](./error-codes-exit-messages).
Expand Down