-
Notifications
You must be signed in to change notification settings - Fork 0
MockData v0.3.0 - Unified API and garbage generation #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
DougManuel
wants to merge
34
commits into
main
Choose a base branch
from
v030-refactor
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+25,033
−2,513
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Repatriate enhanced mockdata architecture from chmsflow PR #7 to create a proper R package that can be used across all recodeflow projects. **Package structure:** - DESCRIPTION: R package metadata - LICENSE: MIT license - README.md: Comprehensive documentation with quick start and examples - .Rbuildignore: Build configuration **Core functions** (from chmsflow/R/): - mockdata-parsers.R: Parse recodeflow-standard metadata formats - parse_variable_start(): Extract raw variable names - parse_range_notation(): Handle range syntax - mockdata-helpers.R: Query and filter metadata - get_cycle_variables(): Filter by cycle - get_raw_variables(): Get unique raw variables - get_variable_categories(): Extract categories - mockdata-generators.R: Generate mock data - create_cat_var(): Categorical variables with tagged NAs - create_con_var(): Continuous variables with distributions **Validation tools** (from chmsflow/mockdata-tools/): - inst/validation/mockdata-tools/: Metadata validation scripts - validate-metadata.R: R CMD check-style validator - test-all-cycles.R: Integration testing - create-comparison.R: Compare approaches - README.md: Detailed documentation **Tests** (from chmsflow/tests/): - tests/testthat/test-mockdata.R: 224 tests covering all functions - tests/testthat.R: Standard testthat entry point **Key features:** - Metadata-driven (uses existing variables.csv/variable-details.csv) - Recodeflow-standard notation support - Works across CHMS, CCHS, and future projects - Well-tested and documented **Roadmap:** - v0.1.0: Categorical/continuous variables (current) - Future: Date variables, quality injection, performance optimization This creates a reusable package that eliminates code duplication across recodeflow projects and provides a single source of truth for mock data generation.
**Testing Setup:**
- Added CHMS test data to inst/testdata/chms/ for package testing
- chms-variables.csv (213 variables)
- chms-variable-details.csv (1,111 details)
- README.md documenting test data usage
- Updated all validation tools to use new test data paths
- Created TESTING_SUMMARY.md documenting all test results
**R CMD CHECK Fixes:**
- Removed vignette builder from DESCRIPTION (no vignettes yet)
- Added stats imports to NAMESPACE (rnorm, runif, sample)
- Wrapped all examples in \dontrun{} to avoid undefined object errors
- Generated roxygen documentation (.Rd files in man/)
- Added .gitignore to exclude build artifacts
**Test Results:**
✅ 160 unit tests passing (test-mockdata.R)
✅ Validation tools passing (validate-metadata.R, test-all-cycles.R)
✅ 99%+ coverage across all 12 CHMS cycles
✅ R CMD check: 2 warnings, 4 notes (minor documentation issues)
**Package Status:**
- Fully functional and tested
- Ready for use in chmsflow and other projects
- Minor documentation polishing can be done post-merge
See TESTING_SUMMARY.md for complete test results and remaining issues.
Removed duplicate versions of create_cat_var.R, create_con_var.R, and util.R that were superseded by the enhanced versions in mockdata-generators.R. The old files were Juan's original simple implementations. The new versions in mockdata-generators.R are enhanced with: - Proper roxygen documentation - Better error handling - Support for more metadata formats - Tagged NA handling - Comprehensive testing Updated example .qmd files to source the new enhanced functions instead. Files removed: - R/create_cat_var.R (69 lines) - R/create_con_var.R (75 lines) - R/util.R (9 lines) Files updated: - Generate_mock_data.qmd - Generate_mock_data_DemPoRT.qmd
Refactor MockData to R package best practices (v0.2.0) Major changes: - Reorganize directory structure to R conventions (inst/extdata/, inst/examples/) - Split mockdata-generators.R into create_cat_var.R and create_con_var.R - Create sample metadata files (~20 variables) with consistent naming - Add package infrastructure (NEWS.md, CONTRIBUTING.md, CODE_OF_CONDUCT.md) - Update DESCRIPTION to v0.2.0 with VignetteBuilder: quarto All 224 tests passing. See NEWS.md for complete changelog. Maintains original logic while extending to support full recodeflow schema notation.
change variables in example
Documentation fixes: - Update roxygen examples in R/create_cat_var.R to use alc_11 (was DHH_SEX, clc_age) - Update roxygen examples in R/create_con_var.R to use alcdwky (was HWTGBMI, DHHAGAGE) - All examples now use variables that exist in sample CHMS data README improvements: - Shorter and clearer. Fixed typos.
Refactor MockData to R package (v0.2.0)
Implements create_date_var() with distribution options and prop_invalid parameter for all generators. **Core date functionality:** - Parse SAS date formats (YYMMDD10., DATE9., etc.) with support for ranges - Generate Date objects within specified periods - Three distribution types: - uniform: Flat distribution across range (default) - gompertz: Gompertz survival distribution (mortality patterns) - exponential: Exponential distribution (early events) - Support for prop_NA (missing dates using NA codes) **prop_invalid parameter (new):** - Added to create_date_var(), create_cat_var(), create_con_var() - Generates invalid/out-of-range values for testing validation pipelines - Date variables: 1-5 years before/after valid range - Categorical: values outside defined categories - Continuous: values outside min/max range - Critical for testing data quality workflows **Parser enhancements:** - parse_sas_date_format(): Extract min/max dates from SAS formats - Support for MDY, YMD, DATE formats with various widths - Handle period ranges like '01JAN2001'd to '31MAR2017'd **Documentation:** - New dates.qmd vignette with comprehensive examples - Distribution comparisons and use cases - prop_invalid demonstrations for all variable types - Updated function documentation with @family tags **Testing:** - 250 tests passing (61 new tests added) - Coverage for all distributions - prop_invalid edge cases - Date parsing validation **Vignette improvements:** - Apply devtools::load_all() pattern to cchs/chms/demport examples - Consistent with PR #7 vignette standards
…prehensive documentation
This PR implements v0.2 configuration format with date variables, garbage data support, and comprehensive documentation restructuring following the Divio framework.
## Overview
Implements date variable support for DemPoRT v2, along with garbage data generation for validation testing, and major documentation improvements.
## Key Changes
### New Features:
- **Date variable generation** with `create_date_var()` supporting three distributions (uniform, gompertz, exponential)
- **Survival analysis** with `create_survival_dates()` for cohort studies with temporal ordering
- **Garbage data support** via `prop_invalid` parameter across all variable types for validation testing
- **v0.2 configuration format** - expanded for dates and garbage data specifications
- **rType field** for proper R type coercion (factor, integer, numeric, Date)
- **Proportion-based generation** for all variable types via `determine_proportions()`
### Documentation (Divio Framework):
- **Tutorials**: getting-started, tutorial-config-files, tutorial-dates, tutorial-missing-data, tutorial-garbage-data
- **How-to guides**: cchs-example, chms-example, demport-example
- **Explanation**: dates, advanced-topics
- **Reference**: reference-config
- Updated all vignette frontmatter (authors, callouts, next steps)
- Standardized heading capitalization (sentence case)
- Added "recodeflow universe" and Statistics Canada acknowledgements to README
- **Clarified mock vs synthetic data distinctions** with appropriate use cases and limitations
### Package Quality:
- Add renv for reproducible package management (R >= 4.2.0)
- Update DESCRIPTION: Juan Li as author, recodeflow contributors
- Add `.claude/AI.md` for project-specific AI development guidelines
- Add CONTRIBUTING.md with pkgdown build instructions
- Improve pkgdown reference page section descriptions
- **Add Quarto-style callout CSS** for native callout syntax with proper styling in pkgdown
### GitHub Actions:
- Automated pkgdown deployment via GitHub Actions
- Deploy `main` branch → root (/)
- Deploy `create-date-var` branch → /dev
- Deploy other branches → /preview/{branch-name}
- **Fix locale issue** (en_CA → en_US.UTF-8) for Ubuntu compatibility
- **Use r-lib/actions/setup-r-dependencies@v2** for dependency management
- Generate documentation with roxygen2::roxygenize() before build
### Bug Fixes:
- Fixes #5 - Improved 'else' handling in recEnd rules
- Fixed continuous variable missing codes to generate proper numeric values
- Fixed locale issues in date parsing for cross-platform compatibility
### Breaking Changes
⚠️ Configuration format updated to v0.2 - requires additional fields:
- `uid` - unique identifier for each variable configuration
- `rType` - R type specification (factor, integer, numeric, Date)
- `proportion` - distribution weights for categorical values
- Date variables require `role` containing "date" and `rType = "Date"`
- Date variables use `variableType = "Continuous"` (for recodeflow compatibility)
## Files Changed
- R/create_date_var.R - locale fix for cross-platform compatibility
- R/create_survival_dates.R - v0.2 format implementation
- .github/workflows/pkgdown.yaml - comprehensive workflow improvements
- pkgdown/extra.css - NEW: Quarto-style callout styling
- _pkgdown.yml - add CSS include, improve reference sections
- README.md - comprehensive mock data documentation
- vignettes/*.qmd - updated all 11 vignettes
- .claude/AI.md - documented debugging lessons learned
All code examples tested and verified in executable vignettes.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
…erage ## Breaking Changes **New function API** - All generator functions now accept full metadata data frames: - create_cat_var(), create_con_var(), create_date_var() now take var + full metadata - Removed pre-filtering requirement - Added databaseStart parameter for internal filtering **Deprecated:** prop_garbage parameter in create_wide_survival_data() - Use garbage parameters in metadata instead via add_garbage() helper ## New Features **Unified garbage generation** across all variable types: - garbage_low_prop + garbage_low_range for values below valid range - garbage_high_prop + garbage_high_range for values above valid range - New add_garbage() helper for easy garbage specification - Categorical garbage now supported **Derived variable identification:** - identify_derived_vars() - identifies derived variables - get_raw_var_dependencies() - extracts dependencies - Compatible with recodeflow patterns ## Bug Fixes - Fixed categorical garbage factor level bug - Fixed recEnd column requirement - now optional - Fixed derived variable generation in create_mock_data() ## Documentation **Restructured using Divio framework:** - Removed 6 vignettes (cchs, chms, demport examples) - Added 2 new tutorials (categorical-continuous, survival-data) - Massively expanded reference-config (2,028 lines) - All vignettes updated to v0.3.0 API - All examples use inst/extdata/minimal-example/ only **Final structure (9 vignettes):** - Tutorials (6): getting-started, tutorial-*, tutorial-garbage-data - How-to guides (1): for-recodeflow-users - Explanation (1): advanced-topics - Reference (1): reference-config ## Test Coverage - 291 passing tests (+72 from 219, +33% improvement) - New test suites: parse_range_notation, read_mock_data_config - Comprehensive garbage generation tests - 0 failures, excellent coverage ## Migration Guide Update function calls: 1. Pass variable name as string (not pre-filtered row) 2. Pass full metadata data frames (not subsets) 3. Add databaseStart parameter 4. Remove manual filtering Update garbage specification: 1. Remove prop_garbage from create_wide_survival_data() 2. Add garbage to metadata using add_garbage() helper 3. Update parameter names (low_prop → garbage_low_prop, etc.)
Resolved all conflicts by keeping v030-refactor (HEAD) versions: Conflict resolution: - .gitignore: kept .tmp/ entry from HEAD - NEWS.md: kept v0.3.0 changelog from HEAD - NAMESPACE, R files: kept v0.3.0 refactored code - README.md: kept v0.3.0 version Deleted files (removed in v0.3.0 refactoring): - inst/extdata/cchs/* (moved to minimal-example) - inst/extdata/demport/* (moved to minimal-example) - inst/metadata/README.md and metadata_registry.yaml - vignettes: cchs-example, chms-example, demport-example Added files from origin/create-date-var: - DOCUMENTATION_FINAL_REVIEW.md - R/create_survival_dates.R (legacy, superseded by create_wide_survival_data) - Legacy metadata and vignette files All v0.3.0 functionality preserved.
These vignettes were removed in v0.3.0 but re-added during merge. They cause build failures and are no longer part of the v0.3.0 documentation.
Remove deleted branches (create-date-var, master) and add v030-refactor as default deploy option. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Shows branch, target, and URL in GitHub Actions summary with guidance on using "." for root deployment. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
rafdoodle
reviewed
Dec 6, 2025
rafdoodle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler mock data generation API works without errors in getting-started.qmd and tutorial-categorical-continuous.qmd.
Comments on dates:
- In
tutorial-survival-data.qmd,admin_censor_datevaries in value and is not generated to be the same date across individuals, despite specifications ininst/extdata/minimal-example/variable-details.csv. rec_with_table()does not currently work with dates as discussed in DemPoRTv2 PR #16.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR supersedes closed PR#9 and represents a major refactoring with breaking changes to the function API.
For reviewers (@rafdoodle , @karimhalal )
This release simplifies creating mock data for your projects with: Simpler function calls.
Unified garbage generation - Add invalid data to any variable type:
(but a better approach is to use
garbage_**headers invariables.csvandvariable_details.csv)Better documentation
variables.csvandvariable_details.csv.Breaking changes
⚠️ All generator functions have new signatures - See NEWS.md for migration guide.
Key changes:
Functions now accept full metadata data frames (not pre-filtered subsets)
New required parameters: var, databaseStart
Unified garbage API: garbage_low_prop/range, garbage_high_prop/range
Deprecated: prop_garbage in create_wide_survival_data()
What's carried forward from PR#9
✅ Date variable generation (create_date_var())
✅ Survival data support (create_wide_survival_data())
✅ Garbage data for validation testing (now unified across all types)
✅ Type coercion with rType field
Testing
291 passing tests (+33% from v0.2.2)
Complete test coverage for new API
All vignettes render successfully with pkgdown::build_site()
Preview
Try the examples from getting-started or tutorial-categorical-continuous to see the new API in action.