Use augur subsample#103
Conversation
e869e31 to
18232af
Compare
18232af to
b0d6728
Compare
| genome/all-time: | ||
| samples: | ||
| all-time: | ||
| <<: [*subsample_genome, *subsample_all-time, *subsample_defaults] |
There was a problem hiding this comment.
In general, I'm not sure this pattern of YAML anchors/aliases is very user friendly. If we do use anchors/aliases, then we should probably disable them in the dumped config so that users can at least see the fully expanded config in results/run_config.yaml (see suggested workaround in yaml/pyyaml#103).
I wonder if this can use the config wildcards pattern that @jameshadfield used in avian-flu. These could be expanded at start of Snakemake (or whatever comes of discussion in nextstrain/public#23).
There was a problem hiding this comment.
If we do use anchors/aliases, then we should probably disable them in the dumped config
Yes, agreed. (I'd go further: we should write out small-multiples with lots of duplication, but that's beyond this PR.)
I wonder if this can use the config wildcards pattern that @jameshadfield used in avian-flu.
Across the 9 builds almost all of the difference is the subsampling parameters, so at some level it is going to be either complex or verbose. I don't think glob-like syntax will help here with the structure as it is now, but it could if you moved towards something more like:
subsample:
samples:
min_length:
"genome/*": 10_000
"G/*": 600
"F/*": 1_200
group_by: "year country"
min_coverage: 0.3
resolutions:
"*/all-time": {"min_date": "1795-01-01"}
"*/6y": {"min_date": "6Y", "background_min_date": "1975-01-01"}
"*/3y": {"min_date": "3Y", "background_min_date": "1975-01-01"}There was a problem hiding this comment.
I also don't think we should use YAML anchors and aliases, and would like to decide on a good alternative in nextstrain/public#27. I'll consider something like the above with config pre-processing to translate into augur subsample-ready config.
There was a problem hiding this comment.
For my ability to follow this thread, the reason against using YAML anchors and aliases:
- not sure if the pattern of YAML anchors/aliases is very user friendly (for which users? A google search is returning several articles recommending YAML anchors (example) to avoid YAML duplication, but maybe these are not our user group or feel free to add comments from WA-DOH/others etc.)
- requires resolving the anchors explicitly in the dumped config (similar lift as
config pre-processing to translate into augur subsample-ready config)
Feel free to add to the above list, mostly looking for a summary statement
There was a problem hiding this comment.
not sure if the pattern of YAML anchors/aliases is very user friendly
I don't think we're advising against anchors per-se -- they can be very useful! -- rather I think the pushback was against the specific usage of anchors in this config. (The pushback wasn't from me, so I won't elaborate.)
requires resolving the anchors explicitly in the dumped config
Anchors are resolved automatically, i.e. yaml.dump(yaml.load(...)) will not preserve anchors.
There was a problem hiding this comment.
Agreed on options > datasets. However, I think treating samples as the most important concept makes defaults something that needs to be understood rather than implicit.
With "options > datasets > samples", configuration is applied in layers:
- global defaults
- dataset-level overrides
- sample-level overrides
In this model, samples are an optional refinement. For some users, such as those using a single augur filter call, the default single-sample behavior is sufficient and they don't need to think about samples at all. When needed, datasets can then be split into multiple samples with more granular config values.
We can extend this idea to datasets as well. This works nicely with the avian-flu idea that options can take scalar values applied to all datasets.
Using a new example, let's look at the current filter config for zika. With the single-dataset default behavior, swapping to augur subsample would be as simple as swapping the key name from filter to subsample.
subsample:
group_by:
- country
- year
- month
sequences_per_group: 40
min_date: 2012
min_length: 5385When the need arises to start making multiple datasets (e.g. genome vs E gene), only dataset-specific values need to change. (unchanged values shown as comments)
subsample:
# group_by:
# - country
# - year
# - month
# sequences_per_group: 40
# min_date: 2012
min_length:
'genome': 5385
'E': 1400When the need arises to start making multiple samples within a dataset (e.g. time resolutions), sample-level overrides can be introduced.
subsample:
# group_by:
# - country
# - year
# - month
max_sequences:
'*/all-time': 1000
'*/1y':
early: 200
recent: 800
min_date:
'*/all-time': 2012
'*/1y':
early: 2012
recent: 1Y
max_date:
'*/1y':
early: 1Y
# min_length:
# 'genome/*': 5385
# 'E/*': 1400In all cases, shared configuration remains unchanged unless explicitly overridden.
There was a problem hiding this comment.
I've incorporated some of the ideas above into the latest changes.
[@jameshadfield] Making the association between datasets and samples clearer is a big one for me.
Added in a28275d.
[@jameshadfield] in avian-flu you can supply a scalar value directly if all wildcard combinations use the same value.
Done. Now the YAML diff in 852ebb7 is easier to compare with the previous filter config.
[@joverlee521] I'm a little unsure about the concatenation of
queryand list options.
Removed in 487e0f4.
[@joverlee521] I wonder if we should also support directly providing the subsample config
Done using a separate key, mutually exclusive with custom_subsample:
Lines 170 to 183 in a28275d
There was a problem hiding this comment.
Thanks for the latest iteration Victor! I'm pretty happy with where this has landed, but the goal post moves again...Now I'm questioning where would proximal subsampling parameters fit into the wildcard config? These are very sample focused so I don't see them as top level options next to group_by, min_length, etc. Would these nest under samples?
subsample:
samples:
'*/*/(6y|3y)':
- recent
- background:
focal_sample: recent
k: 5
max_distance: 5There was a problem hiding this comment.
Indeed, all the prototypes so far have not considered the more sophisticated config schema that is now supported by augur subsample with support for proximal samples and samples dependent on other samples.
subsample_datasets is an easy way out but we shouldn't rely on that. For the default config, the option-first structure clashes with the sample-first nature of the augur subsample config (now I see why James treats samples as the most important concept).
I don't think we should support both subsample.<option>... and subsample.samples...<option>. That would introduce yet another layer of merging to complicate things.
The option-first structure (option > datasets > samples) works best for filter-sample-only config and worst for config with proximal samples, while a sample-first structure (as James proposed) is the opposite. I wonder if reverting to a dataset-first structure would be a good middle ground.
There was a problem hiding this comment.
(Apologies for the delayed reply - let's discuss this at a dev-chat meeting soon based on the gdoc you're creating! I'm glad the concept of tiered/hierarchical subsampling has entered this discussion, our config approach should support it.)
|
I'll wait for a decision in nextstrain/public#27 before continuing here. |
b0d6728 to
5ee2efa
Compare
3340439 to
7dd63f2
Compare
5ee2efa to
5490644
Compare
5490644 to
42aa5f6
Compare
e243d2f to
0b22185
Compare
42aa5f6 to
b437074
Compare
a727f0c to
9b071fb
Compare
9b071fb to
627a61b
Compare
Preparing to use build_dir in config.smk, and I figured it'd be good to move both auspice_dir with it.
Similar to "Add separate frequencies config" (0b22185), this rule shouldn't rely on config from another rule.
e54f105 to
e48b466
Compare
To be used for validating augur subsample config.
e48b466 to
5741f0f
Compare
This makes it easier to read the diff for the following commit.
08eb257 to
a28275d
Compare
The previous subsampling implementation was fixed to a two-sample recent+background split with some hardcoded parameters. Replacing it with augur subsample allows for more flexible configuration. To keep the workflow config schema concise, we generate each augur subsample config dynamically using a patterns defined in the config helper functions in config.smk. This is a breaking change and the old configuration will no longer work.
a28275d to
3972508
Compare
datasets.<dataset>.<rule>: <rule config schema> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Description of proposed changes
This PR contains 2 prep commits + 1 main commit. Message from main commit:
The previous subsampling implementation was fixed to a two-sample recent+background split with some hardcoded parameters. Replacing it with augur subsample allows for more flexible configuration.
To keep the workflow config schema concise, we generate each augur subsample config dynamically using a patterns defined in the config helper functions in config.smk.
This is a breaking change and the old configuration will no longer work.
Related issue(s)
Closes #101
Checklist
Update changelogold implementations