Skip to content

Conversation

@noluyorAbi
Copy link

@noluyorAbi noluyorAbi commented Jan 18, 2026

Problem

This PR addresses the "No valid embeddings found in any documents" error reported in issue #252 and related issues (#198, #266, #334).

The root cause is that the embedder configuration in api/config/embedder.json was missing the initialize_kwargs section, which is required to pass API credentials to the embedder client. Without this, the embedder cannot authenticate with OpenAI-compatible APIs or Google APIs, resulting in failed embedding requests.

Additionally, the dimensions field in model_kwargs causes errors with embedding models that do not support matryoshka representation (e.g., bge, Qwen).

Changes

  1. Added initialize_kwargs to the OpenAI embedder configuration

    • Includes api_key and base_url placeholders that are substituted from environment variables
    • Enables authentication with OpenAI-compatible embedding APIs
  2. Removed the dimensions field from model_kwargs

    • Prevents errors with models that do not support matryoshka representation
  3. Added initialize_kwargs to the Google embedder configuration

    • Includes api_key placeholder for Google API authentication

Configuration

Users need to set the following environment variables:

  • OPENAI_API_KEY - API key for OpenAI-compatible embedding models
  • OPENAI_BASE_URL - Base URL for OpenAI-compatible API endpoints
  • GOOGLE_API_KEY - (Optional) API key for Google embedding models

The ${ENV_VAR} placeholders in the configuration are automatically replaced with environment variable values at runtime.

Related Issues


Open with Devin

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @noluyorAbi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves critical issues related to embedding generation by updating the embedder.json configuration. It introduces necessary authentication parameters for both OpenAI-compatible and Google embedding services, ensuring that API keys and base URLs are correctly utilized. Additionally, it addresses model compatibility by removing a problematic dimensions field, preventing errors with models that do not support specific representations. These changes collectively aim to fix 'No valid embeddings found' errors and improve the robustness of the embedding system.

Highlights

  • OpenAI Embedder Configuration Update: Added initialize_kwargs to the OpenAI embedder configuration to allow passing api_key and base_url from environment variables, enabling proper authentication for OpenAI-compatible embedding APIs.
  • Model Compatibility Fix: Removed the dimensions field from model_kwargs in the OpenAI embedder configuration to prevent errors with embedding models that do not support matryoshka representation.
  • Google Embedder Configuration Update: Added initialize_kwargs to the Google embedder configuration to allow passing api_key from environment variables, enabling proper authentication for Google embedding APIs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an authentication issue with embedders by adding initialize_kwargs to the configuration. While the intent is correct and the removal of the dimensions parameter is a good improvement, the current implementation of initialize_kwargs introduces a critical issue. If environment variables are not set, their placeholder strings are passed as literal credentials, which bypasses the clients' more robust error handling and default value logic. This can lead to non-obvious runtime failures. My review comments detail this problem and recommend removing the new initialize_kwargs sections, as the client classes are already designed to handle credential loading from the environment.

Comment on lines +4 to +7
"initialize_kwargs": {
"api_key": "${OPENAI_API_KEY}",
"base_url": "${OPENAI_BASE_URL}"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change introduces a potential issue when environment variables are not set. The replace_env_placeholders function will pass the literal placeholder string (e.g., "${OPENAI_BASE_URL}") to the OpenAIClient constructor if the corresponding environment variable is missing.

This causes two problems:

  1. For base_url, it bypasses the client's logic to use the default OpenAI URL (https://api.openai.com/v1) when OPENAI_BASE_URL is not set. The client will instead try to connect to the invalid URL "${OPENAI_BASE_URL}".
  2. For api_key, it bypasses the client's validation that raises a ValueError if the API key is missing. Instead, it will attempt to authenticate with the invalid key "${OPENAI_API_KEY}".

The OpenAIClient is already designed to read api_key and base_url from environment variables if they are not passed to the constructor. Relying on that existing mechanism is more robust.

While the removal of the dimensions field is a good improvement, I recommend removing the initialize_kwargs section entirely. The client will correctly handle credentials on its own. If the original issue persists, it might be due to the environment variables not being correctly propagated to the application, which should be investigated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenAIClient is already designed to read api_key and base_url from environment variables if they are not passed to the constructor. Relying on that existing mechanism is more robust.

I am curious what issues are you seeing if it is not defined directly in the kwargs?

Comment on lines +22 to +24
"initialize_kwargs": {
"api_key": "${GOOGLE_API_KEY}"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the OpenAI configuration, this change can cause issues if the GOOGLE_API_KEY environment variable is not set. The GoogleEmbedderClient will be initialized with api_key as the literal string "${GOOGLE_API_KEY}".

This bypasses the client's logic to raise a ValueError for a missing key, and will instead cause a failure later when genai.configure() is called with an invalid key. The PR description also mentions this key is optional, which makes this behavior particularly problematic.

The GoogleEmbedderClient already handles reading the API key from the environment. It's better to rely on the client's implementation.

I recommend removing this initialize_kwargs section to allow the client to manage its own credential loading.

Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View issue and 4 additional flags in Devin Review.

Open in Devin Review

Comment on lines 9 to 12
"model_kwargs": {
"model": "text-embedding-3-small",
"dimensions": 256,
"encoding_format": "float"
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Removal of dimensions parameter causes embedding dimension mismatch with cached databases

Removing the dimensions: 256 parameter from the OpenAI embedder configuration causes a breaking change for users with existing cached databases.

Click to expand

Background

The text-embedding-3-small model defaults to 1536 dimensions when no dimensions parameter is specified. The previous configuration explicitly set dimensions: 256 to reduce embedding size.

How the bug is triggered

  1. User has an existing cached database (.pkl file) at ~/.adalflow/databases/{repo_name}.pkl with 256-dimensional embeddings (created before this change)
  2. User updates to the new configuration (without dimensions parameter)
  3. System loads the cached 256-dimensional embeddings from api/data_pipeline.py:869-892
  4. When querying, the system generates a 1536-dimensional query embedding using the new configuration
  5. FAISS retriever fails because query embedding dimension (1536) doesn't match document embedding dimension (256)

Code flow

  • api/data_pipeline.py:869-892 loads existing databases without checking embedding dimension compatibility:
if self.repo_paths and os.path.exists(self.repo_paths["save_db_file"]):
    logger.info("Loading existing database...")
    self.db = LocalDB.load_state(self.repo_paths["save_db_file"])
    documents = self.db.get_transformed_data(key="split_and_embed")
    if documents:
        # ... logs dimensions but doesn't validate against current config
        return documents  # Returns old embeddings
  • api/rag.py:385-390 creates FAISS retriever with mismatched dimensions:
self.retriever = FAISSRetriever(
    **configs["retriever"],
    embedder=retrieve_embedder,  # Uses new 1536-dim embedder
    documents=self.transformed_docs,  # Contains old 256-dim embeddings
    document_map_func=lambda doc: doc.vector,
)

Impact

  • Runtime errors when querying repositories that have cached databases
  • Error message: "All embeddings should be of the same size" or similar FAISS dimension mismatch error
  • Users must manually delete cached databases to recover

Recommendation: Either: (1) Keep the dimensions: 256 parameter to maintain backward compatibility, or (2) Add dimension validation in api/data_pipeline.py to detect and rebuild databases when embedding dimensions don't match the current configuration.

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

No valid XML found in response

2 participants