Skip to content

[BUG] input_files (PDFFile) are passed as base64 via read_file tool, causing context overflow and inconsistent LLM behavior #5930

@axelgbl

Description

@axelgbl

Description

When using input_files with PDFFile (or File), CrewAI does not appear to handle the file as a native file input at the provider level.

Instead, the file is processed via the read_file tool and its content is returned as a binary/base64 representation. This content is then indirectly injected into the agent execution context.

As a result:

  • The PDF is effectively treated as inline binary data (base64)
  • The LLM context becomes extremely large
  • Responses become inconsistent or fail due to context overflow
  • The same file is re-processed during agent execution via tools

This makes PDFFile unreliable for large or even medium-sized documents.

Steps to Reproduce

  1. Create a minimal CrewAI setup:
from crewai import Agent, Task, Crew
from crewai_files import PDFFile

agent = Agent(
    role="Document Analyst",
    goal="Extract structured information from PDFs",
    backstory="Expert in document analysis",
    llm="gpt-4o-mini",
)

task = Task(
    description="""
    Read the PDF document {doc}
    Extract the main sections and summarize them precisely.
    """,
    expected_output="Structured list of sections",
    agent=agent,
)

crew = Crew(agents=[agent], tasks=[task], verbose=True)

result = crew.kickoff(
    input_files={
        "doc": PDFFile(source="./src/test_crewai_files/pdfs/sample.pdf")
    }
)

print(result)
  1. Run the flow:
    crewai run
  2. Observe agent execution logs with verbose=True

Expected behavior

The PDF should be:

  • either streamed or parsed externally before being sent to the LLM
  • or converted into structured text chunks (not raw base64)

The agent should receive:

  • structured text
  • or extracted segments
  • NOT a raw base64 PDF representation

Screenshots/Code snippets

from crewai import Agent, Task, Crew
from crewai_files import PDFFile

agent = Agent(
    role="Document Analyst",
    goal="Extract structured information from PDFs",
    backstory="Expert in document analysis",
    llm="gpt-4o-mini",
)

task = Task(
    description="""
    Read the PDF document {doc}
    Extract the main sections and summarize them precisely.
    """,
    expected_output="Structured list of sections",
    agent=agent,
)

crew = Crew(agents=[agent], tasks=[task], verbose=True)

result = crew.kickoff(
    input_files={
        "doc": PDFFile(source="./src/test_crewai_files/pdfs/sample.pdf")
    }
)

print(result)

Operating System

Ubuntu 24.04

Python Version

3.12

crewAI Version

v.1.14.4

crewAI Tools Version

v.1.14.4

Virtual Environment

Venv

Evidence

Verbose output evidence

Tool Execution Started (#3)

Tool: read_file
Args: {'file_name': 'doc'}

Tool Execution Completed (#3)

Tool Completed
Tool: read_file
Output: [Binary file: sample.pdf (application/pdf)]
Base64:
...

Possible Solution

None

Additional context

This issue blocks any production usage of PDF ingestion in multi-step CrewAI flows, because:

  • context size grows linearly with file size
  • multiple tasks re-trigger file expansion
  • sequential workflows amplify token explosion

A safer architecture would:

  • load file once
  • extract structured representation once
  • reuse extracted representation across tasks without re-injecting raw binary

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions