Skip to content

tests: qualitative vs LLM usage #322

@planetf1

Description

@planetf1

mellea uses a qualitative marker on pytests.

The current intent is well matched to the name - it is testing output quality rather than infrastructure.

However another issue is even the non-qualatative tests require llms/api keys, which a developer working on mellea may not have.

I think it would be useful to add additional markers to make it easier to run the tests, given any limitations on resource usage/availability of keys

For example we currently have a variety of 'dimensions'

Dimension Options
Backend Ollama, OpenAI, Watsonx, HuggingFace, vLLM, LiteLLM
Auth None, API key required
Resources Light, Heavy (32GB+ RAM, GPU)
Determinism Deterministic infra test, Non-deterministic quality test

We could create more markers - for example going with backend

  @pytest.mark.ollama
  @pytest.mark.openai      # implies API key
  @pytest.mark.watsonx     # implies API key
  @pytest.mark.huggingface # implies heavy resources
  @pytest.mark.vllm        # implies GPU
  @pytest.mark.qualitative # orthogonal - output quality check

Commands:

  pytest -m "not (ollama or openai or watsonx or huggingface or vllm)"  # Pure unit tests
  pytest -m "ollama and not qualitative"   # Ollama infra only
  pytest -m "not (openai or watsonx or huggingface or vllm)"  # Works with just Ollama

Or alteratively go for capability markers:

  @pytest.mark.llm         # Any LLM call (needs at least Ollama)
  @pytest.mark.api_key     # Needs external API key
  @pytest.mark.heavy       # Needs 32GB+ RAM or GPU
  @pytest.mark.qualitative # Non-deterministic output check

Commands:

  pytest -m "not llm"                      # Fast unit tests (~seconds)
  pytest -m "llm and not (api_key or heavy or qualitative)"  # Ollama-only
  pytest -m "not (api_key or heavy)"       # Everything without special setup

The challenge with this is that it puts more cognitive load on the developer to figure out what to do

So what we could do is to have conftest.py detect

  • environment variables used for API keys
  • system RAM availability/platform

Then developers would just need to run the test for a sensible default, given their limitations - and we can ensure any unavailable tests are skipped with an appropriate message

Happy to create a PR if there is some agreement on this. I think I've figured out an approach...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions