Add clustered migration sync for shared disks (SYNCING barrier)#403
Add clustered migration sync for shared disks (SYNCING barrier)#403fabi200123 wants to merge 1 commit intocloudbase:masterfrom
Conversation
| shared disks. Once the owner replicate task completes, any | ||
| waiting (SYNCING) replicate tasks are moved back to SCHEDULED | ||
| so they can continue their normal flow. | ||
| """ |
There was a problem hiding this comment.
So as far as I understand this, this will basically block the rest of the cluster while the main one gets transferred. I mean this is fine for shared disks, but what I proposed initially was to have them all running in parallel once the syncing tasks (which was not meant to be REPLICATE_DISKS btw) are done. You can add SYNCING on this task alone while waiting for the rest of the tasks to complete (tasks like DEPLOY_REPLICA_DISKS, SHUTDOWN_INSTNANCE), that's also feasible, but please set the volumes_info accordingly, not block all of them.
What I originally envisioned was something like this:
instance1 has: root_disk1, shared_disk1;
instance2 has root_diks2, shared_disk1.
What's the point of blocking instance2 while instance1 is replicating? When you can have the following:
instance1 replicates root_disk1, shared_disk1;
instance2 replicates root_disk2, skips shared_disk1, based on the volumes_info that you can set up before launching the replicate_disks task.
There was a problem hiding this comment.
Please let me know if there's anything preventing us from doing a parallel sync, with shared_disks being transferred by only one of the instances.
9813d2b to
61ebc94
Compare
| constants.TASK_STATUS_PENDING, | ||
| constants.TASK_STATUS_STARTING): | ||
| constants.TASK_STATUS_STARTING, | ||
| constants.TASK_STATUS_SYNCING): |
There was a problem hiding this comment.
Wasn't this already handling whatever _abort_peer_sync_barrier_tasks_on_error does? If one of the tasks gets an error, the rest of the tasks get cancelled, even if the error is from another instance.
There was a problem hiding this comment.
The idea here is to differentiate between an error in the first task, after which this one awaits
Introduce TASK_STATUS_SYNCING and TASK_TYPES_TO_SYNC (GET_INSTANCE_INFO, DEPLOY_TRANSFER_DISKS, REPLICATE_DISKS) so multi-instance transfers with base_transfer_action.clustered=True wait for all peer tasks of the same type before leaving SYNCING for COMPLETED and advancing dependents. - clustered is set as len(instances) > 1 on transfer create - On task_completed: enter SYNCING when the barrier applies, then when every peer is SYNCING, run sync hooks (GET_INSTANCE_INFO: promote shareable on export disks, DEPLOY_TRANSFER_DISKS: shared-disk volumes_info, REPLICATE_DISKS: sync change_id) - ReplicateDisksTask: skip provider replicate for replicate_disk_data=False - On task error: abort peers stuck in SYNCING for the same task type
Introduce TASK_STATUS_SYNCING and TASK_TYPES_REQUIRING_CLUSTER_SYNC (DEPLOY_TRANSFER_DISKS, SHUTDOWN_INSTANCE) so multi-instance transfers with base_transfer_action.clustered=True wait for all peer tasks before marking COMPLETED and advancing dependents.
Volumes schema already allows extra properties; replicate_disk_data is consumed by replication only (default True preserves behavior).