Skip to content

feat: add fair_unified_task_shared memory pool to fix 2x memory allocation#3924

Open
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:task-shared-unified-pool
Open

feat: add fair_unified_task_shared memory pool to fix 2x memory allocation#3924
andygrove wants to merge 5 commits intoapache:mainfrom
andygrove:task-shared-unified-pool

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Apr 10, 2026

Which issue does this PR close?

Closes #3921

Rationale for this change

When Comet executes a shuffle, it creates two separate native plans (the child plan and the shuffle writer plan) that run concurrently in a pipelined fashion. Previously, each plan got its own memory pool at the full per-task limit, effectively allowing 2x the intended memory to be consumed.

The new fair_unified_task_shared pool type shares a single CometFairMemoryPool across all native plans within the same Spark task. This ensures the total memory stays within the per-task limit while dynamically distributing memory among operators based on how many register as memory consumers.

This pool type is opt-in (not the default) to allow further testing and evaluation.

What changes are included in this PR?

  • Add fair_unified_task_shared memory pool type that shares a single pool across all native plans within the same Spark task, using a global HashMap<task_attempt_id, PerTaskMemoryPool> with reference counting
  • Fix a tracing bug where total_reserved_for_thread() and unregister_and_total() double-counted memory when multiple execution contexts shared the same pool Arc. Deduplicates by Arc data pointer before summing reserved()
  • Keep fair_unified as the default; fair_unified_task_shared is opt-in via spark.comet.exec.memoryPool
  • Update tuning docs to document all three off-heap pool types

How are these changes tested?

Tested with TPC-H queries using tracing to compare jemalloc and pool-tracked memory between fair_unified and fair_unified_task_shared. Confirmed that the tracing fix eliminates the false 2x reporting and that both pool types show equivalent actual memory usage.

andygrove and others added 4 commits April 10, 2026 10:24
…ation

When Comet executes a shuffle, it creates two separate native plans (the
child plan and the shuffle writer plan) that run concurrently in a
pipelined fashion. Previously, each plan got its own memory pool at the
full per-task limit, effectively allowing 2x the intended memory to be
consumed.

The new `fair_unified_task_shared` pool type shares a single
CometFairMemoryPool across all native plans within the same Spark task.
This ensures the total memory stays within the per-task limit while
dynamically distributing memory among operators based on how many
register as memory consumers (e.g. if the child plan is a simple
scan+filter, the shuffle writer gets 100% of the pool).

This is now the default for off-heap mode.

Closes apache#3921

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When using fair_unified_task_shared, multiple execution contexts on the
same thread share a single Arc<dyn MemoryPool>. The tracing code was
summing pool.reserved() for each registered context, double-counting
the shared pool and reporting 2x the actual memory reservation.

Deduplicate pools by Arc data pointer before summing so each underlying
pool is only counted once.
Make fair_unified_task_shared opt-in rather than the default to
simplify review. Update docs to reflect the new default.
@andygrove andygrove changed the title feat: add fair_unified_task_shared memory pool to fix 2x memory allocation [experimental] feat: add fair_unified_task_shared memory pool to fix 2x memory allocation Apr 11, 2026
Add context about how Comet creates two concurrent native plans per
Spark task during shuffle and why this matters for pool selection.
@andygrove andygrove marked this pull request as ready for review April 11, 2026 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comet allocates 2x available memory in memory pools for executePlan

1 participant