-
Notifications
You must be signed in to change notification settings - Fork 1.5k
fix: add initialize_kwargs for embedder API authentication #457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: add initialize_kwargs for embedder API authentication #457
Conversation
Summary of ChangesHello @noluyorAbi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves critical issues related to embedding generation by updating the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an authentication issue with embedders by adding initialize_kwargs to the configuration. While the intent is correct and the removal of the dimensions parameter is a good improvement, the current implementation of initialize_kwargs introduces a critical issue. If environment variables are not set, their placeholder strings are passed as literal credentials, which bypasses the clients' more robust error handling and default value logic. This can lead to non-obvious runtime failures. My review comments detail this problem and recommend removing the new initialize_kwargs sections, as the client classes are already designed to handle credential loading from the environment.
| "initialize_kwargs": { | ||
| "api_key": "${OPENAI_API_KEY}", | ||
| "base_url": "${OPENAI_BASE_URL}" | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change introduces a potential issue when environment variables are not set. The replace_env_placeholders function will pass the literal placeholder string (e.g., "${OPENAI_BASE_URL}") to the OpenAIClient constructor if the corresponding environment variable is missing.
This causes two problems:
- For
base_url, it bypasses the client's logic to use the default OpenAI URL (https://api.openai.com/v1) whenOPENAI_BASE_URLis not set. The client will instead try to connect to the invalid URL"${OPENAI_BASE_URL}". - For
api_key, it bypasses the client's validation that raises aValueErrorif the API key is missing. Instead, it will attempt to authenticate with the invalid key"${OPENAI_API_KEY}".
The OpenAIClient is already designed to read api_key and base_url from environment variables if they are not passed to the constructor. Relying on that existing mechanism is more robust.
While the removal of the dimensions field is a good improvement, I recommend removing the initialize_kwargs section entirely. The client will correctly handle credentials on its own. If the original issue persists, it might be due to the environment variables not being correctly propagated to the application, which should be investigated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OpenAIClient is already designed to read api_key and base_url from environment variables if they are not passed to the constructor. Relying on that existing mechanism is more robust.
I am curious what issues are you seeing if it is not defined directly in the kwargs?
| "initialize_kwargs": { | ||
| "api_key": "${GOOGLE_API_KEY}" | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the OpenAI configuration, this change can cause issues if the GOOGLE_API_KEY environment variable is not set. The GoogleEmbedderClient will be initialized with api_key as the literal string "${GOOGLE_API_KEY}".
This bypasses the client's logic to raise a ValueError for a missing key, and will instead cause a failure later when genai.configure() is called with an invalid key. The PR description also mentions this key is optional, which makes this behavior particularly problematic.
The GoogleEmbedderClient already handles reading the API key from the environment. It's better to rely on the client's implementation.
I recommend removing this initialize_kwargs section to allow the client to manage its own credential loading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "model_kwargs": { | ||
| "model": "text-embedding-3-small", | ||
| "dimensions": 256, | ||
| "encoding_format": "float" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 Removal of dimensions parameter causes embedding dimension mismatch with cached databases
Removing the dimensions: 256 parameter from the OpenAI embedder configuration causes a breaking change for users with existing cached databases.
Click to expand
Background
The text-embedding-3-small model defaults to 1536 dimensions when no dimensions parameter is specified. The previous configuration explicitly set dimensions: 256 to reduce embedding size.
How the bug is triggered
- User has an existing cached database (
.pklfile) at~/.adalflow/databases/{repo_name}.pklwith 256-dimensional embeddings (created before this change) - User updates to the new configuration (without
dimensionsparameter) - System loads the cached 256-dimensional embeddings from
api/data_pipeline.py:869-892 - When querying, the system generates a 1536-dimensional query embedding using the new configuration
- FAISS retriever fails because query embedding dimension (1536) doesn't match document embedding dimension (256)
Code flow
api/data_pipeline.py:869-892loads existing databases without checking embedding dimension compatibility:
if self.repo_paths and os.path.exists(self.repo_paths["save_db_file"]):
logger.info("Loading existing database...")
self.db = LocalDB.load_state(self.repo_paths["save_db_file"])
documents = self.db.get_transformed_data(key="split_and_embed")
if documents:
# ... logs dimensions but doesn't validate against current config
return documents # Returns old embeddingsapi/rag.py:385-390creates FAISS retriever with mismatched dimensions:
self.retriever = FAISSRetriever(
**configs["retriever"],
embedder=retrieve_embedder, # Uses new 1536-dim embedder
documents=self.transformed_docs, # Contains old 256-dim embeddings
document_map_func=lambda doc: doc.vector,
)Impact
- Runtime errors when querying repositories that have cached databases
- Error message: "All embeddings should be of the same size" or similar FAISS dimension mismatch error
- Users must manually delete cached databases to recover
Recommendation: Either: (1) Keep the dimensions: 256 parameter to maintain backward compatibility, or (2) Add dimension validation in api/data_pipeline.py to detect and rebuild databases when embedding dimensions don't match the current configuration.
Was this helpful? React with 👍 or 👎 to provide feedback.
Problem
This PR addresses the "No valid embeddings found in any documents" error reported in issue #252 and related issues (#198, #266, #334).
The root cause is that the embedder configuration in
api/config/embedder.jsonwas missing theinitialize_kwargssection, which is required to pass API credentials to the embedder client. Without this, the embedder cannot authenticate with OpenAI-compatible APIs or Google APIs, resulting in failed embedding requests.Additionally, the
dimensionsfield inmodel_kwargscauses errors with embedding models that do not support matryoshka representation (e.g., bge, Qwen).Changes
Added
initialize_kwargsto the OpenAI embedder configurationapi_keyandbase_urlplaceholders that are substituted from environment variablesRemoved the
dimensionsfield frommodel_kwargsAdded
initialize_kwargsto the Google embedder configurationapi_keyplaceholder for Google API authenticationConfiguration
Users need to set the following environment variables:
OPENAI_API_KEY- API key for OpenAI-compatible embedding modelsOPENAI_BASE_URL- Base URL for OpenAI-compatible API endpointsGOOGLE_API_KEY- (Optional) API key for Google embedding modelsThe
${ENV_VAR}placeholders in the configuration are automatically replaced with environment variable values at runtime.Related Issues