Skip to content

Wrapper gaps: dead duplicate framework runners, divergent async scheduler, unsafe shared singletons in src/praisonai/praisonai #1722

@MervinPraison

Description

@MervinPraison

Scope

In-depth analysis of src/praisonai/praisonai (the wrapper layer). I focused only on key architectural gaps — not docs, tests, coverage, file sizes, or line counts.

After a full pass over the wrapper, three gaps stand out because they actively violate the project's stated philosophy ("simpler · DRY · multi-agent safe · async-safe · no global singletons · protocol-driven"). Each one is reproducible from the current main, has been read end-to-end, and the proposed fix is concrete.


Gap 1 — agents_generator.py carries ~470 lines of dead, duplicate, broken framework-runner methods

Where

src/praisonai/praisonai/agents_generator.py

Five private methods defined on AgentsGenerator:

Method Lines
_run_autogen 673–751
_run_autogen_v4 753–859
_run_ag2 861–1005
_run_crewai 1006–1143
_run_praisonai 1144–EOF

Why this is a problem

  1. They are never called. The execution path is generate_crew_and_kickoffself.framework_adapter.run(...) at line 607:

    # agents_generator.py:602-615
    from .framework_adapters.validators import assert_framework_available
    assert_framework_available(framework)
    self.logger.info(f"Using framework: {framework}")
    return self.framework_adapter.run(
        config,
        self.config_list,
        topic,
        tools_dict=tools_dict,
        agent_callback=getattr(self, 'agent_callback', None),
        task_callback=getattr(self, 'task_callback', None),
        cli_config=getattr(self, 'cli_config', None),
    )

    A repo-wide grep confirms no caller for _run_autogen, _run_autogen_v4, _run_ag2, _run_crewai, or _run_praisonai anywhere in src/. The only matching hit is the unrelated _run_praisonai_agents in jobs/executor.py.

  2. They cannot run even if called — names are not imported in this module:

    # agents_generator.py:688 (inside _run_autogen)
    user_proxy = autogen.UserProxyAgent(   # NameError: 'autogen' is not defined
        name="User",
        ...
    )
    # agents_generator.py:770 (inside _run_autogen_v4)
    model_client = OpenAIChatCompletionClient(   # NameError
        model=model_config.get('model', 'gpt-5-nano'),
        ...
    )
    # 796: AutoGenV4AssistantAgent — undefined
    # 815: TextMentionTermination — undefined
    # 816: MaxMessageTermination  — undefined
    # 820: RoundRobinGroupChat    — undefined
  3. They duplicate the canonical adapter implementations. Compare _run_autogen (lines 673–751) to framework_adapters/autogen_adapter.py:29–100:

    # framework_adapters/autogen_adapter.py:55-89  (the real, working impl)
    import autogen
    user_proxy = autogen.UserProxyAgent(
        name="User",
        human_input_mode="NEVER",
        is_termination_msg=lambda x: ...,
        code_execution_config={"work_dir": "coding", "use_docker": False},
    )
    ...
    agents[role] = autogen.AssistantAgent(
        name=agent_name,
        llm_config=llm_config_dict,
        system_message=...,
    )

    This is character-for-character the same code as agents_generator._run_autogen, except the adapter version actually imports autogen and is wired into the registry. The same pattern holds for v0.4, AG2, CrewAI, and PraisonAI.

This directly contradicts the philosophy: "DRY: reuse abstractions, no duplication" and "Protocol-driven core: protocols/hooks in core; heavy impl in wrapper/tools". Today, anyone fixing a CrewAI bug has to guess which copy is live (the adapter), and the dead copy slowly drifts and rots.

Fix

Delete the five methods. No call sites, no public API impact, no shim needed. Concretely, in src/praisonai/praisonai/agents_generator.py:

# DELETE: lines 673-EOF (the five _run_* methods)
# KEEP: generate_crew_and_kickoff(...) which already delegates to
#       self.framework_adapter.run(...) at line 607

That removes ~470 lines of dead code, eliminates the drift hazard, and lets agents_generator.py shrink to what it actually does: parse the YAML, validate, pick the adapter, hand off.


Gap 2 — async_agent_scheduler.py claims to be a "backward-compatible re-export" but is a full duplicate implementation with silent feature drift

Where

  • src/praisonai/praisonai/async_agent_scheduler.py (424 lines — async)
  • src/praisonai/praisonai/scheduler/agent_scheduler.py (554 lines — sync, canonical)
  • src/praisonai/praisonai/scheduler/shared.py (shared helpers)

Why this is a problem

The header of async_agent_scheduler.py advertises itself as a thin shim:

# async_agent_scheduler.py:1-15
"""Backward-compatible re-export. Prefer `praisonai.scheduler`.

This module is deprecated. Use the canonical implementation in the
scheduler package for full functionality including async support.
"""
import warnings
warnings.warn(
    "praisonai.async_agent_scheduler is pending deprecation; it will be moved to "
    "praisonai.scheduler.async_agent_scheduler in a future release. ...",
    PendingDeprecationWarning, stacklevel=2,
)

# TODO: Once AsyncAgentScheduler is moved to scheduler package, import from there
# For now, re-export the existing implementation to avoid breaking changes
from .scheduler.shared import ScheduleParser, backoff_delay, safe_call

…but everything after that TODO is a full, independent implementation of AsyncAgentExecutorInterface, AsyncPraisonAgentExecutor, and AsyncAgentScheduler (lines 30–425). Meanwhile the sync AgentScheduler lives in scheduler/agent_scheduler.py. The two share ScheduleParser / backoff_delay, and nothing else.

That has already produced silent feature drift. The async scheduler is missing features the sync one ships today:

Feature (sync, scheduler/agent_scheduler.py) Async (async_agent_scheduler.py)
timeout per-execution kill switch (lines 44, 244–259) ❌ absent
max_cost budget guard (lines 45, 216–220) ❌ absent
Cost tracking (self._total_cost, lines 269–272) ❌ absent
_update_state_if_daemon() (lines 175–210) — daemon state file updates ❌ absent
from_yaml(...) constructor (lines 337–419) ❌ absent
from_recipe(...) constructor (lines 441–535) ❌ absent
start_from_yaml_config() (lines 421–438) ❌ absent

So praisonai.AgentScheduler and praisonai.AsyncAgentScheduler look symmetric in the public surface, but the async path silently drops budget protection, recipe support, YAML support, daemon state, and timeouts. The retry/execute loops are also separately written:

# sync — scheduler/agent_scheduler.py:233-309
def _execute_with_retry(self, max_retries: int):
    self._execution_count += 1
    ...
    for attempt in range(max_retries):
        try:
            if self.timeout:
                signal.alarm(self.timeout)
                result = self._executor.execute(self.task)
                ...
            self._total_cost += estimated_cost
            self._success_count += 1
            ...
        except Exception as e:
            ...
            wait_time = backoff_delay(attempt, initial=30.0, cap=300.0)
            if self._stop_event.wait(wait_time):
                return
# async — async_agent_scheduler.py:353-390
async def _execute_with_retry(self, max_retries: int):
    self._ensure_async_primitives()
    async with self._stats_lock:
        self._execution_count += 1
    ...
    for attempt in range(max_retries):
        try:
            result = await self._executor.execute(self.task)
            async with self._stats_lock:
                self._success_count += 1
            safe_call(self.on_success, result)
            return
        except Exception as e:
            last_exc = e
            if attempt < max_retries - 1:
                wait_time = backoff_delay(attempt)
                await asyncio.sleep(wait_time)

Two implementations of the same retry/stats lifecycle, differing only in await vs blocking calls. This is exactly what the philosophy calls out: "Multi-agent + async safe by default" and "DRY: reuse abstractions" — the wrapper should have one scheduler that exposes both sync and async entry points over a shared core.

Fix

  1. Move the file to its canonical home and complete the in-flight migration the TODO calls out:

    • Create src/praisonai/praisonai/scheduler/async_agent_scheduler.py containing the class.
    • Reduce the root async_agent_scheduler.py to a thin re-export, matching the existing pattern in agent_scheduler.py:1-18:
      # src/praisonai/praisonai/async_agent_scheduler.py  (after fix)
      """Backward-compatible re-export. Prefer `praisonai.scheduler`."""
      import warnings
      warnings.warn(
          "praisonai.async_agent_scheduler is deprecated; "
          "use 'from praisonai.scheduler import AsyncAgentScheduler' instead.",
          DeprecationWarning, stacklevel=2,
      )
      from praisonai.scheduler.async_agent_scheduler import (  # noqa: F401
          AsyncAgentScheduler, AsyncPraisonAgentExecutor, create_async_agent_scheduler,
      )
  2. Factor the execute/retry/stats lifecycle into a shared mixin in scheduler/shared.py so the two classes diverge only on await boundaries. Sketch:

    # scheduler/shared.py  (new)
    class _ScheduleCore:
        """Shared lifecycle: stats, budget, timeout, retry policy."""
        def _begin_attempt(self): self._execution_count += 1
        def _on_success(self, result):
            self._success_count += 1
            self._total_cost += self._estimate_cost(result)
            safe_call(self.on_success, result)
        def _on_failure(self, exc):
            self._failure_count += 1
            safe_call(self.on_failure, exc)
        def _budget_exhausted(self) -> bool:
            return bool(self.max_cost and self._total_cost >= self.max_cost)

    Then AgentScheduler provides sync _execute_with_retry and AsyncAgentScheduler provides async _execute_with_retry, both calling into _ScheduleCore. Port timeout, max_cost, from_yaml, from_recipe, and _update_state_if_daemon into the shared layer so the async path picks them up for free.


Gap 3 — Process-level mutable singletons in interactive_tools.py and tool_resolver.py break multi-agent and multi-project use

The philosophy is explicit: "no global singletons, no heavy module-level work" and "multi-agent safe by default". Two concrete violations in the wrapper today.

3a. _shared_runtime is global, unlocked, and shared across all concurrent agents

src/praisonai/praisonai/cli/features/interactive_tools.py:196-253

# ── Shared runtime singleton ────────────────────────────────────────────────
# ACP and LSP tools share ONE InteractiveRuntime to avoid duplicate LSP servers.
_shared_runtime = None
_shared_agent_tools = None


def _get_shared_runtime(config: ToolConfig):
    """Get or create a shared InteractiveRuntime instance."""
    global _shared_runtime, _shared_agent_tools

    if _shared_runtime is not None and _shared_agent_tools is not None:
        return _shared_runtime, _shared_agent_tools

    from .interactive_runtime import InteractiveRuntime, RuntimeConfig
    from .agent_tools import create_agent_centric_tools

    runtime_config = RuntimeConfig(
        workspace=config.workspace,
        lsp_enabled=config.lsp_enabled,
        acp_enabled=config.acp_enabled,
        approval_mode=config.approval_mode,
    )

    _shared_runtime = InteractiveRuntime(runtime_config)
    _shared_agent_tools = create_agent_centric_tools(_shared_runtime)
    ...

Two real problems:

  1. No lock. Two agents instantiating tools concurrently both pass the is not None check, both call InteractiveRuntime(...), the second overwrites the first — leaking an LSP server process and an ACP session that will never be stopped because cleanup_runtime only sees the latest reference.

  2. Config is captured on first call and silently ignored after. Agent A calls with workspace="/foo", approval_mode="auto". Agent B later calls with workspace="/bar", approval_mode="manual" — B gets back A's runtime, bound to /foo and auto, with no warning. The config argument becomes a lie for everyone after the first caller.

3b. _default_resolver is anchored to the CWD at first call and never invalidated

src/praisonai/praisonai/tool_resolver.py:458-477

# Process-level lazy singleton for performance (matches profiler.py pattern)
_default_resolver: Optional[ToolResolver] = None
_default_resolver_lock = threading.Lock()

def _get_default_resolver() -> ToolResolver:
    """Process-default ToolResolver (double-checked lazy init).

    Returns cached ToolResolver that is anchored to the working directory
    at first call. Local tools.py resolution is CWD-dependent and cached
    for the lifetime of the process.

    For test isolation or multi-project CLIs, create explicit resolver
    instances instead of using this cached default.
    """
    global _default_resolver
    if _default_resolver is None:
        with _default_resolver_lock:
            if _default_resolver is None:
                _default_resolver = ToolResolver()
    return _default_resolver

ToolResolver.__init__ defaults tools_py_path to the literal string "tools.py" (resolver line 62) and caches whatever it loaded on first access (lines 78–114). In any long-lived process — daemon, IDE plugin, MCP server, Jupyter session, the SDK as a library — the first resolve_tool("foo") call locks in that CWD's tools.py. Switch projects, change directory, run a second praisonai command in the same Python process, and you still get the original project's tools. The docstring acknowledges the hazard but recommends "create explicit resolver instances" — which nothing else in the wrapper does, so all internal call sites silently inherit the bug.

Fix

For both: replace process-level globals with contextvars.ContextVar so each agent / task / request gets its own value, and the singleton becomes the explicit fallback only at the CLI edge.

interactive_tools.py:

# Replace lines 196-253 with:
import contextvars
import threading

_runtime_var: contextvars.ContextVar = contextvars.ContextVar("interactive_runtime", default=None)
_runtime_lock = threading.Lock()  # only guards the first creation per context

def _get_shared_runtime(config: ToolConfig):
    bundle = _runtime_var.get()
    if bundle is not None:
        return bundle
    with _runtime_lock:
        bundle = _runtime_var.get()
        if bundle is None:
            from .interactive_runtime import InteractiveRuntime, RuntimeConfig
            from .agent_tools import create_agent_centric_tools
            runtime = InteractiveRuntime(RuntimeConfig(
                workspace=config.workspace,
                lsp_enabled=config.lsp_enabled,
                acp_enabled=config.acp_enabled,
                approval_mode=config.approval_mode,
            ))
            bundle = (runtime, create_agent_centric_tools(runtime))
            _runtime_var.set(bundle)
    return bundle

def cleanup_runtime():
    bundle = _runtime_var.get()
    if bundle is None:
        return
    runtime, _ = bundle
    # ... existing shutdown logic, then:
    _runtime_var.set(None)

Concurrent agents now each get their own runtime when running under their own contextvars.copy_context(); the config argument is honoured on every first-in-context call instead of being silently dropped.

tool_resolver.py:

# Replace lines 458-477 with:
import contextvars

_resolver_var: contextvars.ContextVar[Optional[ToolResolver]] = contextvars.ContextVar(
    "tool_resolver", default=None,
)

def _get_default_resolver() -> ToolResolver:
    """Per-context resolver. Falls back to a fresh ToolResolver per context,
    so changing CWD between agents / requests is honoured."""
    resolver = _resolver_var.get()
    if resolver is None:
        resolver = ToolResolver()
        _resolver_var.set(resolver)
    return resolver

def reset_default_resolver() -> None:
    """Explicit invalidation hook for daemons / IDE plugins switching projects."""
    _resolver_var.set(None)

This keeps the fast path (cache once per context) but stops binding the entire process to whatever CWD the first caller happened to be in. Callers that already pass an explicit resolver= argument (the resolve_tool / resolve_tools / has_tool / validate_yaml_tools family at lines 481–547) are unaffected.


Why these three

  • Gap 1 is dead — fixing it is pure removal, zero risk, and immediately stops new code from being added to the wrong copy.
  • Gap 2 is drift-in-progress — the sync scheduler is gaining features (max_cost, from_recipe, daemon state) the async one quietly lacks, so the longer it sits, the more the two diverge.
  • Gap 3 is latent-but-load-bearing — the singletons work fine for one-shot CLI runs (which is most of the test surface) but quietly break the exact deployment shape the project is moving toward: long-lived agents, multi-project daemons, MCP/UI servers running many concurrent sessions.

All three are scoped to src/praisonai/praisonai. Happy to send PRs for any/all of these if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingclaudeAuto-trigger Claude analysisdocumentationImprovements or additions to documentationperformance

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions