Skip to content

allocation checks: check branch.status instead of making call to k8s#427

Open
boddumanohar wants to merge 1 commit intomainfrom
resource-limits-check-fix
Open

allocation checks: check branch.status instead of making call to k8s#427
boddumanohar wants to merge 1 commit intomainfrom
resource-limits-check-fix

Conversation

@boddumanohar
Copy link
Member

@boddumanohar boddumanohar commented Dec 12, 2025

Currently during branch create we have checks to check against provisioning limits. These changes takes time because it goes through all the branches and queries k8s for the resources information.

This is very expensive operation. So removed _collect_branch_statuses function call and replaced it with logic to get the same from branch.status

Instead we could query branch.status. The tradeoff here is branch.status is around 1 minute old.

@boddumanohar boddumanohar force-pushed the resource-limits-check-fix branch from 9244f7d to ae12ecb Compare December 17, 2025 09:19
@mxsrc
Copy link
Collaborator

mxsrc commented Dec 17, 2025

Do you have numbers on how long the current approach takes?

@boddumanohar
Copy link
Member Author

Generally create branch API takes around 10-15 secs to return 201 response.

@mxsrc
Copy link
Collaborator

mxsrc commented Dec 17, 2025

Is all of that due to the provisioning check? On second thought, probably not. I suggest profiling what actually takes most of the time before optimizing parts and chosing trade-offs. This should be possible using py-spy in the deployed container: py-spy record -o profile.svg --pid 12345, or by editing the code:

import cProfile
...
    profiler = cProfile.Profile()
    profiler.enable()
    ....
    profiler.disable()
    stats = pstats.Stats(profiler)
    stats.sort_stats('cumulative')
    stats.print_stats(20)

@boddumanohar
Copy link
Member Author

boddumanohar commented Dec 17, 2025

if a project has 100 branches, the function _aggregate_group_by_resource_type calls _collect_branch_service_health which probes all the branch services and gets its health.

From the branch status, we exclude the branches with status BranchServiceStatus.STOPPED, BranchServiceStatus.DELETING

_BRANCH_SERVICE_ENDPOINTS: dict[str, tuple[str, int]] = {
    "database": ("db", 5432),
    "pgbouncer": ("pgbouncer", 6432),
    "storage": ("storage", 5000),
    "meta": ("meta", 8080),
    "rest": ("rest", 3000),
    "pgexporter": ("pgexporter", 9187),
}

So during branch create, we make 100*6 = 600 network calls for the sake of validations. This seems very costly.

@mxsrc
Copy link
Collaborator

mxsrc commented Dec 17, 2025

I see the point. I think 100 branches is on the upper limit of what we can expect, and we should be able to parallelize this significantly. Still, it's a lot. Do you think it'd be feasible to modify the pod to include a readiness/liveness probe? I'm trying to find a way where we don't have to implement everything ourselves but can use builtins instead.

Copy link
Collaborator

@noctarius noctarius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should be brought up tomorrow morning on the standup. I think, otherwise, this PR is ok, since the branch status is already persisted in the database.

.join(Branch)
.where(
Branch.project_id == project_id,
not_(status_column.in_([BranchServiceStatus.STOPPED, BranchServiceStatus.DELETING])),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dangerous (and I guess it was also wrong with the old code). Stopped branches don't count towards CPU / RAM allocation, but I think storage (db, object) still count, even when stopped. Only deleted branches are out of the game. WDYT @boddumanohar?

@boddumanohar boddumanohar force-pushed the resource-limits-check-fix branch from ae12ecb to 4b71d2a Compare February 7, 2026 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants