Skip to content

Benchmark Regression Test Suite #647

@AdamGleave

Description

@AdamGleave

Problem

imitation's testing is currently limited to static analysis (type checking, linting, etc) and unit testing. There are no automated, end-to-end tests of algorithm training performance. This is problematic as small implementation details in reward and imitation learning can have big impacts on performance.

Solution

We do, however, already have tuned hyperparameters and some initial results in https://github.com/HumanCompatibleAI/imitation/tree/master/benchmarking thanks to @taufeeque9 (PR with the code used for hyperparameter tuning should be forthcoming soon as well). This suggests that we could simply have a test suite that trains all the algorithms using these existing configs and records performance. If the performance drops by more than some threshold, then a warning or error could be issued.

Training end-to-end is far too slow to do on every commit, but it's something we could afford to do before each (significant) release or before merging PRs that we're worried might cause regressions.

There are some tools already that are designed to track metrics over time like airspeed velocity (asv). Integrating it with one of these might make sense, but I don't yet know how well those features line up.

Possible alternative solutions

We could also just add a handful of tests marked as "expensive" to pytest that are skipped by default and can be run on-demand, that do end-to-end training and assert reward is above some threshold. This might give us a non-trivial amount of the benefit but with much less work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions