Skip to content

iops split#665

Draft
boddumanohar wants to merge 1 commit intodevfrom
iops-split
Draft

iops split#665
boddumanohar wants to merge 1 commit intodevfrom
iops-split

Conversation

@boddumanohar
Copy link
Member

@boddumanohar boddumanohar commented Mar 16, 2026

fixes: simplyblock/vela#256

PR: simplyblock/simplyblock-csi#294 allows taking QOS IOPS parameters Via annotation. this PR makes use of it.

PTIR: split the assigned IOPS and provide 25% for the WAL and 75% for the data volume

testing

Created a branch with PITR enabled and IOPS as 1000 (which will create create 2 PVCs)

kubectl -n vela-pr665-01kmdjjheakq8dmd46vjdqsgm8 get pvc
NAME                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                    VOLUMEATTRIBUTESCLASS   AGE
vela-autoscaler-vm-block-data   Bound    pvc-2182896e-0032-43ac-b63e-879a0e9e323f   1Gi        RWX            sc-01kmdjjheakq8dmd46vjdqsgm8   <unset>                 105s
vela-autoscaler-vm-pg-wal       Bound    pvc-2680739c-de3c-458c-bd84-95c2bf3e3150   100Gi      RWX            sc-01kmdjjheakq8dmd46vjdqsgm8   <unset>                 105s

for DATA PVC

sbctl lvol get pvc-2182896e-0032-43ac-b63e-879a0e9e323f | grep rw_ios_per_sec
| rw_ios_per_sec           | 750                                                                    |

and for WAL PVC

sbctl lvol get pvc-2680739c-de3c-458c-bd84-95c2bf3e3150 | grep rw_ios_per_sec
| rw_ios_per_sec           | 250                                                                    |

@boddumanohar boddumanohar requested a review from mxsrc March 17, 2026 04:54
@boddumanohar boddumanohar force-pushed the iops-split branch 2 times, most recently from a0d1273 to 409e5d5 Compare March 19, 2026 06:55
Base automatically changed from dev to main March 19, 2026 14:34
@boddumanohar boddumanohar changed the base branch from main to dev March 22, 2026 18:28
@boddumanohar boddumanohar force-pushed the iops-split branch 2 times, most recently from 7247ac6 to 060a346 Compare March 23, 2026 14:06
@boddumanohar
Copy link
Member Author

@mxsrc This PR is ready for review. Also have added testing results in the PR.
The changes on the CSI driver are already merged and are deployed to dev environment.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements PITR-aware IOPS allocation by splitting the requested branch IOPS budget between the main data volume and the WAL volume (25% WAL / 75% data), leveraging Simplyblock CSI’s per-PVC QOS annotations.

Changes:

  • Add support for rendering user-provided annotations onto the WAL PVC Helm template.
  • Introduce WAL_IOPS_FRACTION and apply IOPS splitting during branch provisioning/config generation and during volume IOPS updates.
  • Ensure clone/restore flows create the StorageClass with the (intended) data-volume IOPS portion.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
src/deployment/charts/vela/templates/autoscaler/wal-pvc.yaml Allows WAL PVC annotations to be rendered from chart values (needed for per-PVC QOS).
src/deployment/init.py Splits IOPS into data/WAL portions; annotates WAL PVC values; updates Simplyblock volumes accordingly.
src/api/organization/project/branch/init.py Adjusts clone/restore StorageClass creation to use the data portion of the IOPS budget.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +359 to +360
data_iops = max(1, round(iops * (1 - WAL_IOPS_FRACTION)))
wal_iops = max(1, round(iops * WAL_IOPS_FRACTION))
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_branch_volume_iops always splits the requested IOPS into data/WAL and applies the reduced data_iops even when a WAL PVC/volume does not exist (e.g., PITR disabled). This will effectively drop 25% of the user-requested IOPS on non-PITR branches. Consider only splitting when the WAL PVC/volume is present (or when PITR is enabled), otherwise apply the full iops to the data volume.

Copilot uses AI. Check for mistakes.
Comment on lines +373 to 382
try:
async with create_simplyblock_api() as sb_api:
await sb_api.update_volume(volume=wal_volume, payload={"max_rw_iops": wal_iops})
except VelaSimplyblockAPIError as exc:
raise VelaDeploymentError("Failed to update WAL volume") from exc
logger.info("Updated Simplyblock WAL volume %s IOPS to %s", wal_volume, wal_iops)
except VelaDeploymentError as exc:
logger.info("WAL volume not found for branch %s; skipping WAL IOPS update: %s", branch_id, exc)


Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WAL update error handling is incorrect here: (1) WAL PVC lookup failures from Kubernetes raise VelaKubernetesError, not VelaDeploymentError, so missing WAL PVCs won’t be skipped as intended; and (2) raising VelaDeploymentError("Failed to update WAL volume") is also caught by this outer handler and logged as “WAL volume not found”, silently masking real update failures. Restructure this block to only swallow the specific “PVC not found” case and propagate/raise on actual update errors.

Suggested change
try:
async with create_simplyblock_api() as sb_api:
await sb_api.update_volume(volume=wal_volume, payload={"max_rw_iops": wal_iops})
except VelaSimplyblockAPIError as exc:
raise VelaDeploymentError("Failed to update WAL volume") from exc
logger.info("Updated Simplyblock WAL volume %s IOPS to %s", wal_volume, wal_iops)
except VelaDeploymentError as exc:
logger.info("WAL volume not found for branch %s; skipping WAL IOPS update: %s", branch_id, exc)
except VelaKubernetesError as exc:
# Only swallow the specific "PVC not found" case; propagate other errors.
cause = exc.__cause__
if isinstance(cause, ApiException) and getattr(cause, "status", None) == 404:
logger.info(
"WAL volume PVC not found for branch %s; skipping WAL IOPS update: %s",
branch_id,
exc,
)
return
raise
try:
async with create_simplyblock_api() as sb_api:
await sb_api.update_volume(volume=wal_volume, payload={"max_rw_iops": wal_iops})
except VelaSimplyblockAPIError as exc:
raise VelaDeploymentError("Failed to update WAL volume") from exc
logger.info("Updated Simplyblock WAL volume %s IOPS to %s", wal_volume, wal_iops)

Copilot uses AI. Check for mistakes.
Comment on lines +541 to +542
data_iops = max(1, round(parameters.iops * (1 - WAL_IOPS_FRACTION)))
wal_iops = max(1, round(parameters.iops * WAL_IOPS_FRACTION))
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_vela_config computes data_iops/wal_iops and creates the branch StorageClass with only data_iops regardless of pitr_enabled. For non-PITR branches (no WAL PVC), this will provision the main volume(s) at ~75% of the requested IOPS. Make the split conditional on pitr_enabled (or presence of WAL), otherwise use parameters.iops as-is.

Suggested change
data_iops = max(1, round(parameters.iops * (1 - WAL_IOPS_FRACTION)))
wal_iops = max(1, round(parameters.iops * WAL_IOPS_FRACTION))
if pitr_enabled:
data_iops = max(1, round(parameters.iops * (1 - WAL_IOPS_FRACTION)))
wal_iops = max(1, round(parameters.iops * WAL_IOPS_FRACTION))
else:
# For non-PITR branches (no WAL PVC), provision the main volume(s)
# with the full requested IOPS instead of a reduced "data_iops".
data_iops = parameters.iops
wal_iops = max(1, round(parameters.iops * WAL_IOPS_FRACTION))

Copilot uses AI. Check for mistakes.
wal_persistence["create"] = not use_existing_db_pvc
wal_persistence["size"] = PITR_WAL_PVC_SIZE
wal_persistence["storageClassName"] = storage_class_name
wal_persistence["annotations"] = {"simplybk/qos-rw-iops": str(wal_iops)}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wal_persistence["annotations"] is assigned a new dict, which will discard any existing WAL PVC annotations coming from chart defaults or user overrides. Merge into the existing annotations mapping (e.g., setdefault + update) so other annotations aren’t lost.

Suggested change
wal_persistence["annotations"] = {"simplybk/qos-rw-iops": str(wal_iops)}
wal_annotations = wal_persistence.setdefault("annotations", {})
wal_annotations["simplybk/qos-rw-iops"] = str(wal_iops)

Copilot uses AI. Check for mistakes.
Comment on lines +873 to +874
data_iops = max(1, round(parameters.iops * (1 - WAL_IOPS_FRACTION)))
storage_class_name = await ensure_branch_storage_class(branch_id, iops=data_iops)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code reduces StorageClass IOPS to data_iops unconditionally. Since this task has a pitr_enabled parameter, the IOPS split should be applied only when PITR/WAL is enabled; otherwise cloned/restored branches will get ~75% of the requested IOPS.

Suggested change
data_iops = max(1, round(parameters.iops * (1 - WAL_IOPS_FRACTION)))
storage_class_name = await ensure_branch_storage_class(branch_id, iops=data_iops)
if pitr_enabled:
data_iops = max(1, round(parameters.iops * (1 - WAL_IOPS_FRACTION)))
storage_class_name = await ensure_branch_storage_class(branch_id, iops=data_iops)
else:
storage_class_name = await ensure_branch_storage_class(branch_id, iops=parameters.iops)

Copilot uses AI. Check for mistakes.
storage_class_name: str | None = None
try:
storage_class_name = await ensure_branch_storage_class(branch_id, iops=parameters.iops)
data_iops = max(1, round(parameters.iops * (1 - WAL_IOPS_FRACTION)))
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StorageClass is created with data_iops unconditionally here. If pitr_enabled is false for the target branch, this will under-provision IOPS by ~25% (no WAL volume to receive the remainder). Apply the split only when PITR/WAL is enabled.

Suggested change
data_iops = max(1, round(parameters.iops * (1 - WAL_IOPS_FRACTION)))
if pitr_enabled:
data_iops = max(1, round(parameters.iops * (1 - WAL_IOPS_FRACTION)))
else:
data_iops = parameters.iops

Copilot uses AI. Check for mistakes.

try:
storage_class_name = await ensure_branch_storage_class(branch_id, iops=parameters.iops)
data_iops = max(1, round(parameters.iops * (1 - WAL_IOPS_FRACTION)))
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in other restore/clone paths, data_iops is derived from the total IOPS without checking whether PITR/WAL is enabled. If the branch does not have a WAL PVC, this will under-allocate IOPS. Consider conditioning this split on PITR/WAL being enabled (or on WAL PVC presence).

Suggested change
data_iops = max(1, round(parameters.iops * (1 - WAL_IOPS_FRACTION)))
if recovery_target_time is not None:
# PITR/WAL is in use; reserve a fraction of total IOPS for WAL.
data_iops = max(1, round(parameters.iops * (1 - WAL_IOPS_FRACTION)))
else:
# No PITR/WAL expected; allocate all available IOPS to the data volume.
data_iops = max(1, parameters.iops)

Copilot uses AI. Check for mistakes.
@boddumanohar
Copy link
Member Author

moving this back to Draft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants