tests: qualitative vs LLM usage

mellea uses a `qualitative` marker on pytests.

The current intent is well matched to the name - it is testing output quality rather than infrastructure.

However another issue is even the non-qualatative tests require llms/api keys, which a developer working on mellea may not have.

I think it would be useful to add additional markers to make it easier to run the tests, given any limitations on resource usage/availability of keys

For example we currently have a variety of 'dimensions'

 | Dimension | Options |
  |-----------|---------|
  | **Backend** | Ollama, OpenAI, Watsonx, HuggingFace, vLLM, LiteLLM |
  | **Auth** | None, API key required |
  | **Resources** | Light, Heavy (32GB+ RAM, GPU) |
  | **Determinism** | Deterministic infra test, Non-deterministic quality test |

We could create more markers - for example going with backend 

```python
  @pytest.mark.ollama
  @pytest.mark.openai      # implies API key
  @pytest.mark.watsonx     # implies API key
  @pytest.mark.huggingface # implies heavy resources
  @pytest.mark.vllm        # implies GPU
  @pytest.mark.qualitative # orthogonal - output quality check
```
  Commands:
```shell
  pytest -m "not (ollama or openai or watsonx or huggingface or vllm)"  # Pure unit tests
  pytest -m "ollama and not qualitative"   # Ollama infra only
  pytest -m "not (openai or watsonx or huggingface or vllm)"  # Works with just Ollama
```

Or alteratively go for capability markers:
```python
  @pytest.mark.llm         # Any LLM call (needs at least Ollama)
  @pytest.mark.api_key     # Needs external API key
  @pytest.mark.heavy       # Needs 32GB+ RAM or GPU
  @pytest.mark.qualitative # Non-deterministic output check
```
  Commands:
```shell
  pytest -m "not llm"                      # Fast unit tests (~seconds)
  pytest -m "llm and not (api_key or heavy or qualitative)"  # Ollama-only
  pytest -m "not (api_key or heavy)"       # Everything without special setup
```
The challenge with this is that it puts more cognitive load on the developer to figure out what to do

So what we could do is to have `conftest.py` detect
- environment variables used for API keys
- system RAM availability/platform

Then developers would just need to run the test for a sensible default, given their limitations  - and we can ensure any unavailable tests are skipped with an appropriate message

Happy to create a PR if there is some agreement on this. I think I've figured out an approach...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tests: qualitative vs LLM usage #322

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dimension	Options
Backend	Ollama, OpenAI, Watsonx, HuggingFace, vLLM, LiteLLM
Auth	None, API key required
Resources	Light, Heavy (32GB+ RAM, GPU)
Determinism	Deterministic infra test, Non-deterministic quality test

tests: qualitative vs LLM usage #322

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions