An AI-powered pipeline that takes a plain-English research question, discovers relevant NCBI GEO datasets, downloads and preprocesses them, identifies top expressed genes, and produces a Markdown report with Gemini-generated biological interpretation.
- Intelligent Planning: Gemini decomposes your prompt into a logical task sequence.
- Smart Discovery: Uses Gemini to refine natural language questions into precise NCBI GEO search queries.
- Automated Workflow:
- Fetches and decompresses GEO series matrix files.
- Cleans and normalizes expression data (log2 transformation).
- Ranks genes by expression levels across samples.
- Biological Interpretation: Gemini explains the scientific significance of the top genes found.
- Structured Reporting: Generates comprehensive Markdown reports in the
reports/directory.
-
Clone the repository:
git clone https://github.com/yourusername/autonomous-bioinformatics-agent.git cd autonomous-bioinformatics-agent -
Set up a virtual environment:
python3 -m venv .venv source .venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
-
Set your Gemini API Key: The agent requires a Google Gemini API key. You can get one from Google AI Studio.
export GEMINI_API_KEY="your_api_key_here"
You can run the agent by passing your research question as a command-line argument:
python main.py "Which genes are most highly expressed in breast cancer GSE2034?"Or run it interactively by just calling the script:
python main.py.
├── agents/ # AI Agents for planning, discovery, and interpretation
├── tools/ # Core bioinformatics tools for analysis and reporting
├── data/ # Local storage for raw and processed datasets (ignored by git)
├── reports/ # Generated Markdown reports (ignored by git)
├── tests/ # Pytest suite for code verification
├── config.py # Global settings and model configuration
├── main.py # Main entry point
└── requirements.txt # Project dependencies
Defaults can be adjusted in config.py:
GEMINI_MODEL: The model version used (defaults tomodels/gemini-flash-latest).TOP_GENE_COUNT: Number of top-ranked genes to include in reports.MAX_DOWNLOAD_RETRIES: Number of attempts for fetching remote datasets.
Run the test suite to ensure everything is configured correctly:
pytest tests/This project is open-source. See the LICENSE for details (if available).