Update the benchmark to launch one instance per (dataset, synthesizer) pair#611
Update the benchmark to launch one instance per (dataset, synthesizer) pair#611R-Palazzo wants to merge 7 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## issue-604-2-private-bucket #611 +/- ##
==============================================================
- Coverage 85.89% 85.79% -0.11%
==============================================================
Files 40 40
Lines 3750 3723 -27
==============================================================
- Hits 3221 3194 -27
Misses 529 529
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fealho
left a comment
There was a problem hiding this comment.
This runs the UniformSynthesizer in every instance. So for example, for the 81 single table runs, it ran 81 times. Is this intended?
Also, do we want to change this? I don't think it affects the monthly run, because the config returns before reaching that line, but if you pass datasets/synthesizers, it still creates only 1 instance.
And could you double check the CLI path still works? I think maybe it depended on the yaml files you deleted. Something like this:
python sdgym/_benchmark_launcher/script.py --modality single_table --output-destination s3://bucket/test --synthesizers CTGANSynthesizer
amontanez24
left a comment
There was a problem hiding this comment.
This looks good. I just want to confirm something. Does the instance that runs the workflow to launch the jobs download all the datasets for all jobs, or only the one for the job it launches?
I understand we are doing one job = 1 synthesizer and dataset, but I want to make sure the instance that launches the job isn't downloading too much
ea2f9bc to
c378d46
Compare
95be93a to
9a91033
Compare
ba22fa7 to
fdb0a93
Compare
@amontanez24, yes, that was the case. The instance that runs the workflow was downloading all the datasets. I tried adding the https://github.com/sdv-dev/SDGym/actions/runs/26508750022/job/78067938428 To avoid this in this PR I moved the dataset loading during execution so it's only the gcp instances that will load the data. I'm running some test and check that it works end-to-end with the |
725c22c to
5555441
Compare
19c7c43 to
5cb84b0
Compare
Hi @fealho thanks for the review!
python sdgym/_benchmark_launcher/script.py --modality single_table --output-destination s3://sdgym-benchmark/Debug/Issue_605_CLI --synthesizers GaussianCopulaSynthesizer |
Resolve #605
86b9zz525