Skip to content

Conversation

@0rajnishk
Copy link

@0rajnishk 0rajnishk commented Jan 11, 2026

Relates to: #26

Summary:
Adds a TTFR estimator module that computes a transparent time range for reaching a first useful output from a dataset (access to basic visualization or analysis). This PR provides the foundation: core logic, tests, and a demo, with proposed integration steps.

Scope of changes:

  • backend/ttfr_estimator.py: Core estimation logic (access, preprocessing, first output)
  • backend/test_ttfr_estimator.py: Unit tests (20 tests, all passing)
  • backend/demo_ttfr.py: Demo script to preview estimates
  • Local docs (gitignored): references/ and instructions/ for planning and integration notes

How it works:

  • Uses datasource defaults (access type, format, typical modality)
  • Infers complexity and multimodality from metadata/content
  • Assesses documentation quality from simple signals
  • Produces a total range with a three-phase breakdown and explicit assumptions

Testing:

cd backend
pytest test_ttfr_estimator.py -v
python demo_ttfr.py

Integration plan (follow-up PR):

  • Add TTFR to RetrievedItem in retrieval.py
  • Compute TTFR per result and include in API responses
  • Format TTFR in agents.py output
  • Optional: display in the frontend UI

Rationale:

  • Addresses user need for dataset feasibility and planning
  • Transparent and explainable estimates; safe fallbacks
  • Modular, low risk, and fully test-covered

Checklist:

  • New module and tests
  • No breaking changes
  • No new mandatory configuration
  • Demo for maintainers

Request for review:

  • Placement of TTFR in responses and desired level of breakdown
  • Any datasource-specific overrides you would like to add

Copilot AI review requested due to automatic review settings January 11, 2026 20:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Time to First Result (TTFR) estimator module that calculates transparent time ranges for reaching a first useful output from neuroscience datasets. The estimator considers access requirements, data preprocessing complexity, and first output generation time based on datasource characteristics and metadata analysis.

Changes:

  • Adds core TTFR estimation logic with configurable datasource defaults and intelligent inference from metadata
  • Provides comprehensive test coverage with 20 unit tests across modality inference, multimodal detection, documentation quality assessment, and estimation scenarios
  • Includes a demonstration script showcasing the estimator with example datasets

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.

File Description
pyproject.toml Adds pytest as a project dependency for testing
backend/ttfr_estimator.py Core estimation module with enums, data classes, datasource configuration, and estimation logic
backend/test_ttfr_estimator.py Comprehensive test suite covering helper functions and end-to-end estimation scenarios
backend/demo_ttfr.py Demo script illustrating TTFR estimation for various dataset types

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +170 to +175
MODALITY_KEYWORDS = {
ModalityComplexity.LOW: ["simulated", "model", "morphology", "ion channel", "database"],
ModalityComplexity.MEDIUM: ["microscopy", "image", "gene expression", "single cell", "ephys", "electrophysiology"],
ModalityComplexity.HIGH: ["mri", "fmri", "pet", "meg", "eeg", "bold", "neuroimaging"],
ModalityComplexity.VERY_HIGH: ["multimodal", "multi-modal", "combined", "integrative"]
}
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MODALITY_KEYWORDS dictionary on line 170-175 includes overlapping keywords. For example, "multimodal" and "multi-modal" are in VERY_HIGH complexity keywords, but the infer_modality_from_keywords function checks from VERY_HIGH to LOW in order (line 243). This means if a dataset description contains "multimodal imaging", it would be classified as VERY_HIGH, but the function would also match it against keywords in other levels. The current implementation works correctly because of the ordered checking, but this could be documented more clearly.

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +41
min_hours = int(self.min_days * 24)
max_hours = int(self.max_days * 24)
return f"{min_hours}{max_hours} hours"
elif self.min_days < 1:
min_hours = int(self.min_days * 24)
max_days = int(self.max_days)
return f"{min_hours} hours–{max_days} days"
else:
return f"{int(self.min_days)}{int(self.max_days)} days"
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The str method has several issues: (1) Using int() truncates instead of rounding, which can be misleading (e.g., 0.9 days becomes 0 days instead of 1 day). (2) There's no handling for singular vs plural forms, which will produce grammatically incorrect output like "1 days" when max_days rounds to 1. (3) The mixed units format on line 39 could be confusing. Consider using round() instead of int() and adding conditional logic for singular/plural forms.

Suggested change
min_hours = int(self.min_days * 24)
max_hours = int(self.max_days * 24)
return f"{min_hours}{max_hours} hours"
elif self.min_days < 1:
min_hours = int(self.min_days * 24)
max_days = int(self.max_days)
return f"{min_hours} hours–{max_days} days"
else:
return f"{int(self.min_days)}{int(self.max_days)} days"
min_hours = round(self.min_days * 24)
max_hours = round(self.max_days * 24)
if min_hours == max_hours:
unit = "hour" if min_hours == 1 else "hours"
return f"{min_hours} {unit}"
return f"{min_hours}{max_hours} hours"
elif self.min_days < 1:
min_hours = round(self.min_days * 24)
max_days = round(self.max_days)
hours_unit = "hour" if min_hours == 1 else "hours"
days_unit = "day" if max_days == 1 else "days"
return f"{min_hours} {hours_unit}{max_days} {days_unit}"
else:
min_days = round(self.min_days)
max_days = round(self.max_days)
if min_days == max_days:
unit = "day" if min_days == 1 else "days"
return f"{min_days} {unit}"
return f"{min_days}{max_days} days"

Copilot uses AI. Check for mistakes.
"langgraph>=0.6.4",
"matplotlib>=3.10.3",
"pandas>=2.3.1",
"pytest>=9.0.2",
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest should be a development dependency, not a production dependency. The pytest.main() call in test_ttfr_estimator.py is only used when running the test file directly, which is a development activity. Consider moving this to a dev-dependencies or optional-dependencies section in pyproject.toml instead.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,342 @@
from dataclasses import dataclass
from typing import Optional, Dict, Any, Tuple
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Tuple import from typing is unused and should be removed to keep imports clean.

Suggested change
from typing import Optional, Dict, Any, Tuple
from typing import Optional, Dict, Any

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +167
DATASOURCE_CONFIG = {
"scr_005031_openneuro": {
"access": AccessType.OPEN,
"format": FormatType.BIDS,
"typical_modality": ModalityComplexity.HIGH,
"doc_quality": "high"
},
"scr_017571_dandi": {
"access": AccessType.OPEN,
"format": FormatType.NWB,
"typical_modality": ModalityComplexity.HIGH,
"doc_quality": "high"
},
"scr_007271_modeldb_models": {
"access": AccessType.OPEN,
"format": FormatType.CUSTOM,
"typical_modality": ModalityComplexity.MEDIUM,
"doc_quality": "medium"
},
"scr_017612_ebrains": {
"access": AccessType.LOGIN,
"format": FormatType.CUSTOM,
"typical_modality": ModalityComplexity.VERY_HIGH,
"doc_quality": "high"
},
"scr_003510_cil_images": {
"access": AccessType.OPEN,
"format": FormatType.STANDARD_IMAGE,
"typical_modality": ModalityComplexity.MEDIUM,
"doc_quality": "medium"
},
"scr_002145_neuromorpho_modelimage": {
"access": AccessType.OPEN,
"format": FormatType.CUSTOM,
"typical_modality": ModalityComplexity.LOW,
"doc_quality": "high"
},
"scr_017041_sparc": {
"access": AccessType.OPEN,
"format": FormatType.CUSTOM,
"typical_modality": ModalityComplexity.HIGH,
"doc_quality": "high"
},
"scr_002978_aba_expression": {
"access": AccessType.OPEN,
"format": FormatType.CUSTOM,
"typical_modality": ModalityComplexity.MEDIUM,
"doc_quality": "high"
},
"scr_005069_brainminds": {
"access": AccessType.APPROVAL,
"format": FormatType.CUSTOM,
"typical_modality": ModalityComplexity.VERY_HIGH,
"doc_quality": "medium"
},
"scr_002721_gensat_geneexpression": {
"access": AccessType.OPEN,
"format": FormatType.STANDARD_IMAGE,
"typical_modality": ModalityComplexity.MEDIUM,
"doc_quality": "medium"
},
"scr_003105_neurondb_currents": {
"access": AccessType.OPEN,
"format": FormatType.CUSTOM,
"typical_modality": ModalityComplexity.LOW,
"doc_quality": "medium"
},
"scr_006131_hba_atlas": {
"access": AccessType.OPEN,
"format": FormatType.CUSTOM,
"typical_modality": ModalityComplexity.MEDIUM,
"doc_quality": "high"
},
"scr_014194_icg_ionchannels": {
"access": AccessType.OPEN,
"format": FormatType.CUSTOM,
"typical_modality": ModalityComplexity.LOW,
"doc_quality": "medium"
},
"scr_013705_neuroml_models": {
"access": AccessType.OPEN,
"format": FormatType.CUSTOM,
"typical_modality": ModalityComplexity.MEDIUM,
"doc_quality": "high"
},
"scr_014306_bbp_cellmorphology": {
"access": AccessType.LOGIN,
"format": FormatType.CUSTOM,
"typical_modality": ModalityComplexity.MEDIUM,
"doc_quality": "high"
},
"scr_016433_conp": {
"access": AccessType.OPEN,
"format": FormatType.CUSTOM,
"typical_modality": ModalityComplexity.HIGH,
"doc_quality": "medium"
},
"scr_006274_neuroelectro_ephys": {
"access": AccessType.OPEN,
"format": FormatType.CUSTOM,
"typical_modality": ModalityComplexity.MEDIUM,
"doc_quality": "high"
}
}
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DATASOURCE_CONFIG dictionary is comprehensive but could benefit from documentation. Consider adding a docstring or comment explaining the structure and purpose of this configuration, especially for maintainers who might need to add new datasources or modify existing ones.

Copilot uses AI. Check for mistakes.
Comment on lines +251 to +258
def detect_multimodal(text: str) -> bool:
if not text:
return False

text_lower = text.lower()
multimodal_keywords = ["multimodal", "multi-modal", "combined", "integrative", " and "]

return any(kw in text_lower for kw in multimodal_keywords)
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function has inconsistent handling of None input. Line 252 checks "if not text" which would return True for both None and empty string, but the function signature accepts str type without Optional. If None is a valid input (as suggested by the test on line 46), the type hint should be Optional[str]. Additionally, the test on line 46 passes None which would fail the type check.

Copilot uses AI. Check for mistakes.
Comment on lines +151 to +153

if __name__ == "__main__":
pytest.main([__file__, "-v"])
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using pytest.main() in the main block is unnecessary. Tests should be run using the pytest command directly from the command line. This pattern adds pytest as a runtime import requirement when running the file directly, which is unconventional. Consider removing this block entirely.

Suggested change
if __name__ == "__main__":
pytest.main([__file__, "-v"])

Copilot uses AI. Check for mistakes.
datasource_id: Optional[str] = None,
metadata: Optional[Dict[str, Any]] = None,
content: Optional[str] = None
) -> TTFREstimate:
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function lacks documentation. Adding a docstring would significantly improve maintainability by explaining the parameters, return value, and the estimation logic. This is especially important for a public API function that will be integrated into other parts of the system.

Suggested change
) -> TTFREstimate:
) -> TTFREstimate:
"""
Estimate the time to first result (TTFR) for working with a datasource.
Parameters
----------
datasource_id:
Optional identifier for a known datasource. When provided and present in
``DATASOURCE_CONFIG``, datasource-specific defaults (access type, format,
typical modality, and documentation quality) are used for the estimate.
metadata:
Optional dictionary of datasource metadata (for example, title, description,
authors, license, documentation links). Certain fields are used to assess
documentation quality and to refine modality inference.
content:
Optional free-form content or description associated with the datasource.
This text is used to infer task complexity (modality) and whether the
scenario is multimodal.
Returns
-------
TTFREstimate
A structured estimate of time to first result, including time ranges and
any assumptions made during estimation.
Notes
-----
If ``datasource_id`` is not provided or not found in ``DATASOURCE_CONFIG``,
the function falls back to general heuristics based on the supplied
``metadata`` and ``content``. Additional assumptions are recorded to explain
how the estimate was derived.
"""

Copilot uses AI. Check for mistakes.
print(f" - {assumption}")

print(f"\nJSON format:")
import json
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The json module is imported inside the loop on line 55, which means it will be imported multiple times unnecessarily. Move the import statement to the top of the file with other imports for better performance and following Python best practices.

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +10
from ttfr_estimator import (
estimate_ttfr,
AccessType,
ModalityComplexity,
FormatType,
infer_modality_from_keywords,
detect_multimodal,
assess_documentation_quality
)
Copy link

Copilot AI Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'AccessType' is not used.
Import of 'FormatType' is not used.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant