Use augur subsample by victorlin · Pull Request #103 · nextstrain/rsv

victorlin · 2025-09-22T22:04:53Z

Description of proposed changes

This PR contains 2 prep commits + 1 main commit. Message from main commit:

The previous subsampling implementation was fixed to a two-sample recent+background split with some hardcoded parameters. Replacing it with augur subsample allows for more flexible configuration.

To keep the workflow config schema concise, we generate each augur subsample config dynamically using a patterns defined in the config helper functions in config.smk.

This is a breaking change and the old configuration will no longer work.

Related issue(s)

Closes #101

Checklist

Draft necessary changes on NW-PaGe fork (branch)
Checks pass
~~Update changelog~~

old implementations

YAML anchors and aliases: b0d6728
matrix: d24a8d1
dataset-first glob syntax 1: 5741f0f
option-first glob syntax 1: bff7e53
option-first glob syntax 2: 1df4e05
dataset-first glob syntax 2: 397850d

joverlee521 · 2025-10-06T21:00:33Z

+  genome/all-time:
+    samples:
+      all-time:
+        <<: [*subsample_genome, *subsample_all-time, *subsample_defaults]


In general, I'm not sure this pattern of YAML anchors/aliases is very user friendly. If we do use anchors/aliases, then we should probably disable them in the dumped config so that users can at least see the fully expanded config in results/run_config.yaml (see suggested workaround in yaml/pyyaml#103).

I wonder if this can use the config wildcards pattern that @jameshadfield used in avian-flu. These could be expanded at start of Snakemake (or whatever comes of discussion in nextstrain/public#23).

If we do use anchors/aliases, then we should probably disable them in the dumped config

Yes, agreed. (I'd go further: we should write out small-multiples with lots of duplication, but that's beyond this PR.)

I wonder if this can use the config wildcards pattern that @jameshadfield used in avian-flu.

Across the 9 builds almost all of the difference is the subsampling parameters, so at some level it is going to be either complex or verbose. I don't think glob-like syntax will help here with the structure as it is now, but it could if you moved towards something more like:

subsample: samples: min_length: "genome/*": 10_000 "G/*": 600 "F/*": 1_200 group_by: "year country" min_coverage: 0.3 resolutions: "*/all-time": {"min_date": "1795-01-01"} "*/6y": {"min_date": "6Y", "background_min_date": "1975-01-01"} "*/3y": {"min_date": "3Y", "background_min_date": "1975-01-01"}

I also don't think we should use YAML anchors and aliases, and would like to decide on a good alternative in nextstrain/public#27. I'll consider something like the above with config pre-processing to translate into augur subsample-ready config.

For my ability to follow this thread, the reason against using YAML anchors and aliases:

not sure if the pattern of YAML anchors/aliases is very user friendly (for which users? A google search is returning several articles recommending YAML anchors (example) to avoid YAML duplication, but maybe these are not our user group or feel free to add comments from WA-DOH/others etc.)

requires resolving the anchors explicitly in the dumped config (similar lift as config pre-processing to translate into augur subsample-ready config)

Feel free to add to the above list, mostly looking for a summary statement

not sure if the pattern of YAML anchors/aliases is very user friendly

I don't think we're advising against anchors per-se -- they can be very useful! -- rather I think the pushback was against the specific usage of anchors in this config. (The pushback wasn't from me, so I won't elaborate.)

requires resolving the anchors explicitly in the dumped config

Anchors are resolved automatically, i.e. yaml.dump(yaml.load(...)) will not preserve anchors.

Agreed on options > datasets. However, I think treating samples as the most important concept makes defaults something that needs to be understood rather than implicit.

With "options > datasets > samples", configuration is applied in layers:

global defaults

dataset-level overrides

sample-level overrides

In this model, samples are an optional refinement. For some users, such as those using a single augur filter call, the default single-sample behavior is sufficient and they don't need to think about samples at all. When needed, datasets can then be split into multiple samples with more granular config values.

We can extend this idea to datasets as well. This works nicely with the avian-flu idea that options can take scalar values applied to all datasets.

Using a new example, let's look at the current filter config for zika. With the single-dataset default behavior, swapping to augur subsample would be as simple as swapping the key name from filter to subsample.

subsample: group_by: - country - year - month sequences_per_group: 40 min_date: 2012 min_length: 5385

When the need arises to start making multiple datasets (e.g. genome vs E gene), only dataset-specific values need to change. (unchanged values shown as comments)

subsample: # group_by: # - country # - year # - month # sequences_per_group: 40 # min_date: 2012 min_length: 'genome': 5385 'E': 1400

When the need arises to start making multiple samples within a dataset (e.g. time resolutions), sample-level overrides can be introduced.

subsample: # group_by: # - country # - year # - month max_sequences: '*/all-time': 1000 '*/1y': early: 200 recent: 800 min_date: '*/all-time': 2012 '*/1y': early: 2012 recent: 1Y max_date: '*/1y': early: 1Y # min_length: # 'genome/*': 5385 # 'E/*': 1400

In all cases, shared configuration remains unchanged unless explicitly overridden.

I've incorporated some of the ideas above into the latest changes.

[@jameshadfield] Making the association between datasets and samples clearer is a big one for me.

Added in a28275d.

[@jameshadfield] in avian-flu you can supply a scalar value directly if all wildcard combinations use the same value.

Done. Now the YAML diff in 852ebb7 is easier to compare with the previous filter config.

[@joverlee521] I'm a little unsure about the concatenation of query and list options.

Removed in 487e0f4.

[@joverlee521] I wonder if we should also support directly providing the subsample config

Done using a separate key, mutually exclusive with custom_subsample:

rsv/README.md

Lines 170 to 183 in a28275d

You can also provide raw Augur subsample configs for exact datasets with

`subsample_datasets`. Keys must be exact dataset names, not patterns. Values

must follow the

[Augur subsample config schema](https://nextstrain.org/schemas/augur/subsample-config/v1).

These raw configs override the generated config for matching datasets only,

while other datasets still use the default merged `subsample` config.

```yaml

subsample_datasets:

a/genome/3y:

samples:

sample:

max_sequences: 100

```

Preview of WA DOH config using subsample_datasets

Thanks for the latest iteration Victor! I'm pretty happy with where this has landed, but the goal post moves again...Now I'm questioning where would proximal subsampling parameters fit into the wildcard config? These are very sample focused so I don't see them as top level options next to group_by, min_length, etc. Would these nest under samples?

subsample: samples: '*/*/(6y|3y)': - recent - background: focal_sample: recent k: 5 max_distance: 5

Indeed, all the prototypes so far have not considered the more sophisticated config schema that is now supported by augur subsample with support for proximal samples and samples dependent on other samples.

subsample_datasets is an easy way out but we shouldn't rely on that. For the default config, the option-first structure clashes with the sample-first nature of the augur subsample config (now I see why James treats samples as the most important concept).

I don't think we should support both subsample.<option>... and subsample.samples...<option>. That would introduce yet another layer of merging to complicate things.

The option-first structure (option > datasets > samples) works best for filter-sample-only config and worst for config with proximal samples, while a sample-first structure (as James proposed) is the opposite. I wonder if reverting to a dataset-first structure would be a good middle ground.

(Apologies for the delayed reply - let's discuss this at a dev-chat meeting soon based on the gdoc you're creating! I'm glad the concept of tiered/hierarchical subsampling has entered this discussion, our config approach should support it.)

victorlin · 2025-10-07T02:18:29Z

I'll wait for a decision in nextstrain/public#27 before continuing here.

Preparing to use build_dir in config.smk, and I figured it'd be good to move both auspice_dir with it.

Similar to "Add separate frequencies config" (0b22185), this rule shouldn't rely on config from another rule.

To be used for validating augur subsample config.

This makes it easier to read the diff for the following commit.

The previous subsampling implementation was fixed to a two-sample recent+background split with some hardcoded parameters. Replacing it with augur subsample allows for more flexible configuration. To keep the workflow config schema concise, we generate each augur subsample config dynamically using a patterns defined in the config helper functions in config.smk. This is a breaking change and the old configuration will no longer work.

datasets.<dataset>.<rule>: <rule config schema> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

victorlin self-assigned this Sep 22, 2025

victorlin force-pushed the victorlin/use-augur-subsample branch from e869e31 to 18232af Compare September 22, 2025 22:24

victorlin mentioned this pull request Sep 22, 2025

Config improvements #102

Merged

2 tasks

victorlin linked an issue Sep 22, 2025 that may be closed by this pull request

Use augur subsample #101

Open

victorlin mentioned this pull request Sep 23, 2025

augur subsample: merging of external & default configs #106

Closed

victorlin force-pushed the victorlin/use-augur-subsample branch from 18232af to b0d6728 Compare September 24, 2025 20:20

joverlee521 reviewed Oct 6, 2025

View reviewed changes

victorlin mentioned this pull request Oct 7, 2025

Use augur subsample nextstrain/mumps#45

Draft

3 tasks

victorlin marked this pull request as draft October 7, 2025 02:18

victorlin mentioned this pull request Oct 14, 2025

Disable aliases in dumped config nextstrain/shared#62

Merged

3 tasks

victorlin force-pushed the victorlin/use-augur-subsample branch from b0d6728 to 5ee2efa Compare February 21, 2026 03:04

victorlin force-pushed the victorlin/update-filter-config branch from 3340439 to 7dd63f2 Compare February 21, 2026 03:04

victorlin force-pushed the victorlin/use-augur-subsample branch from 5ee2efa to 5490644 Compare February 26, 2026 01:04

victorlin force-pushed the victorlin/use-augur-subsample branch from 5490644 to 42aa5f6 Compare March 6, 2026 01:48

victorlin force-pushed the victorlin/update-filter-config branch from e243d2f to 0b22185 Compare March 6, 2026 01:48

Base automatically changed from victorlin/update-filter-config to master March 6, 2026 18:55

victorlin force-pushed the victorlin/use-augur-subsample branch from 42aa5f6 to b437074 Compare March 7, 2026 02:28

victorlin force-pushed the victorlin/use-augur-subsample branch from a727f0c to 9b071fb Compare March 24, 2026 00:01

victorlin force-pushed the victorlin/use-augur-subsample branch from 9b071fb to 627a61b Compare April 8, 2026 00:28

victorlin and others added 2 commits April 8, 2026 13:59

Move constants up in file

b7cb625

Preparing to use build_dir in config.smk, and I figured it'd be good to move both auspice_dir with it.

Add separate config section for filter_for_pre_subsample_alignment

e052166

Similar to "Add separate frequencies config" (0b22185), this rule shouldn't rely on config from another rule.

victorlin force-pushed the victorlin/use-augur-subsample branch 2 times, most recently from e54f105 to e48b466 Compare April 9, 2026 18:20

Define valid dataset parts

6b44998

To be used for validating augur subsample config.

victorlin force-pushed the victorlin/use-augur-subsample branch from e48b466 to 5741f0f Compare April 9, 2026 18:37

joverlee521 mentioned this pull request Apr 10, 2026

phylo: Support different build parameters for augur refine nextstrain/measles#118

Closed

Reorder options

6e7a3f3

This makes it easier to read the diff for the following commit.

victorlin force-pushed the victorlin/use-augur-subsample branch from 08eb257 to a28275d Compare May 6, 2026 23:45

victorlin added 3 commits May 7, 2026 10:44

🚧 glob syntax for refine

6536cb9

🚧 no custom merging

1e15f55

🚧 require explicit dataset to sample mapping

3972508

victorlin force-pushed the victorlin/use-augur-subsample branch from a28275d to 3972508 Compare May 7, 2026 17:48

victorlin mentioned this pull request May 7, 2026

Config based Augur commands nextstrain/augur#1987

Open

7 tasks

🚧 rule-agnostic multi-dataset YAML config

3ca4831

datasets.<dataset>.<rule>: <rule config schema> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use augur subsample#103

Use augur subsample#103
victorlin wants to merge 9 commits into
masterfrom
victorlin/use-augur-subsample

victorlin commented Sep 22, 2025 •

edited

Loading

Uh oh!

joverlee521 Oct 6, 2025

Uh oh!

jameshadfield Oct 6, 2025 •

edited

Loading

Uh oh!

victorlin Oct 7, 2025

Uh oh!

j23414 Oct 7, 2025

Uh oh!

jameshadfield Oct 8, 2025

Uh oh!

victorlin May 5, 2026 •

edited

Loading

Uh oh!

victorlin May 6, 2026 •

edited

Loading

Uh oh!

joverlee521 May 11, 2026

Uh oh!

victorlin May 12, 2026

Uh oh!

jameshadfield May 13, 2026

Uh oh!

victorlin commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	You can also provide raw Augur subsample configs for exact datasets with
	`subsample_datasets`. Keys must be exact dataset names, not patterns. Values
	must follow the
	[Augur subsample config schema](https://nextstrain.org/schemas/augur/subsample-config/v1).
	These raw configs override the generated config for matching datasets only,
	while other datasets still use the default merged `subsample` config.

	```yaml
	subsample_datasets:
	a/genome/3y:
	samples:
	sample:
	max_sequences: 100
	```

Conversation

victorlin commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of proposed changes

Related issue(s)

Checklist

Uh oh!

joverlee521 Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

jameshadfield Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

victorlin Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

j23414 Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

jameshadfield Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

victorlin May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

victorlin May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joverlee521 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

victorlin May 12, 2026

Choose a reason for hiding this comment

Uh oh!

jameshadfield May 13, 2026

Choose a reason for hiding this comment

Uh oh!

victorlin commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

victorlin commented Sep 22, 2025 •

edited

Loading

jameshadfield Oct 6, 2025 •

edited

Loading

victorlin May 5, 2026 •

edited

Loading

victorlin May 6, 2026 •

edited

Loading