-
Notifications
You must be signed in to change notification settings - Fork 71
Description
mellea uses a qualitative marker on pytests.
The current intent is well matched to the name - it is testing output quality rather than infrastructure.
However another issue is even the non-qualatative tests require llms/api keys, which a developer working on mellea may not have.
I think it would be useful to add additional markers to make it easier to run the tests, given any limitations on resource usage/availability of keys
For example we currently have a variety of 'dimensions'
| Dimension | Options |
|---|---|
| Backend | Ollama, OpenAI, Watsonx, HuggingFace, vLLM, LiteLLM |
| Auth | None, API key required |
| Resources | Light, Heavy (32GB+ RAM, GPU) |
| Determinism | Deterministic infra test, Non-deterministic quality test |
We could create more markers - for example going with backend
@pytest.mark.ollama
@pytest.mark.openai # implies API key
@pytest.mark.watsonx # implies API key
@pytest.mark.huggingface # implies heavy resources
@pytest.mark.vllm # implies GPU
@pytest.mark.qualitative # orthogonal - output quality checkCommands:
pytest -m "not (ollama or openai or watsonx or huggingface or vllm)" # Pure unit tests
pytest -m "ollama and not qualitative" # Ollama infra only
pytest -m "not (openai or watsonx or huggingface or vllm)" # Works with just OllamaOr alteratively go for capability markers:
@pytest.mark.llm # Any LLM call (needs at least Ollama)
@pytest.mark.api_key # Needs external API key
@pytest.mark.heavy # Needs 32GB+ RAM or GPU
@pytest.mark.qualitative # Non-deterministic output checkCommands:
pytest -m "not llm" # Fast unit tests (~seconds)
pytest -m "llm and not (api_key or heavy or qualitative)" # Ollama-only
pytest -m "not (api_key or heavy)" # Everything without special setupThe challenge with this is that it puts more cognitive load on the developer to figure out what to do
So what we could do is to have conftest.py detect
- environment variables used for API keys
- system RAM availability/platform
Then developers would just need to run the test for a sensible default, given their limitations - and we can ensure any unavailable tests are skipped with an appropriate message
Happy to create a PR if there is some agreement on this. I think I've figured out an approach...