for a realistic evaluation we need to check what happens if we iteratively evaluate the emulators on their own output and how errors accumulate.