Skip to content

Parallelize flow compilation on submission path#4175

Open
DaisyModi wants to merge 3 commits intoapache:masterfrom
DaisyModi:dmodi/parallelize-flow-compilation-on-submission
Open

Parallelize flow compilation on submission path#4175
DaisyModi wants to merge 3 commits intoapache:masterfrom
DaisyModi:dmodi/parallelize-flow-compilation-on-submission

Conversation

@DaisyModi
Copy link

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

  • My PR addresses the following Gobblin JIRA issues and references them in the PR title.

Description

  • Here are some details about my PR:

Every flow gets compiled twice — once on submission and once on execution. The execution path already runs on a thread pool of size 3 (DagProcessingEngine), proving that compileFlow() is thread-safe. However, the submission path was fully serialized by:

  1. A synchronized block on SpecCatalogListenersList.onAddSpec()
  2. A single-thread executor inside CallbacksDispatcher

This caused unnecessary queuing when multiple flows were submitted simultaneously.

Changes:

  • Removed synchronized from SpecCatalogListenersList.onAddSpec() to allow concurrent flow compilations
  • Added a new constructor to SpecCatalogListenersList that accepts an ExecutorService
  • Updated FlowCatalog to create a configurable FixedThreadPool (default: 3 threads) via gobblin.service.specCatalogListener.numThreads

Safety:

  • onDeleteSpec/onUpdateSpec remain synchronized (topology changes are infrequent)
  • addListener/removeListener remain synchronized
  • topologySpecMap is a ConcurrentMap — safe for concurrent reads
  • The execution path (DagProcessingEngine with 3 threads) already validates compileFlow() thread-safety

Tests

  • My PR does not need new testing because it only removes serialization that was already proven unnecessary by the execution path (DagProcessingEngine) which runs compileFlow() from 3 concurrent threads. Existing FlowCatalogTest and MultiHopFlowCompilerTest cover the compilation flow.

Commits

  • My commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Daisy Modi and others added 3 commits March 16, 2026 14:32
Flow compilation happens twice: on submission and on execution. The execution
path already runs on a thread pool of size 3 (DagProcessingEngine), but the
submission path was serialized by a synchronized block in
SpecCatalogListenersList.onAddSpec() and a single-thread executor in
CallbacksDispatcher.

This change removes the synchronized keyword from onAddSpec() and replaces
the single-thread executor with a configurable thread pool (default 3 threads),
allowing multiple flows to compile in parallel during submission. Compilation
is still serialized per flow but multiple flows now compile concurrently.

The thread pool size is configurable via:
  gobblin.service.specCatalogListener.numThreads (default: 3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace Executors.newFixedThreadPool (unbounded queue) with a bounded
ThreadPoolExecutor to prevent memory growth under heavy submission load.
Queue capacity is set to numThreads * 10 with CallerRunsPolicy so that
when the queue is full, the submitting thread runs the compilation
itself, providing natural backpressure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add inline comment in FlowCatalog explaining why the submission path
is now parallelized and Javadoc on the new SpecCatalogListenersList
constructor documenting the design decision and thread-safety reasoning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant