Skip to content

Conversation

@DougManuel
Copy link
Collaborator

@DougManuel DougManuel commented Nov 6, 2025

This PR implements v0.2 configuration format with date variables, garbage data support, and documentation restructuring.

This PR replaces PR #8, which was deleted when the master branch was renamed to main during GH pages initialization.

Current review priority

The main goal of this PR is to get the code working for use by @karimhalal, @rafdoodle, and @caitlink12.

For this purpose, the DemPoRT example in the vignette is most helpful. Options to review include:

Future review

This PR includes several features to support the creation of mock data that require discussion and review for consideration within the overall recodeflow universe. Specifically, how should recodeflow support data types such as dates and integers? They fall within continuous and categorical data types -- the only types currently supported.

Issues that arose during the development of this PR included the generation of mock:

  • "age" that had real numbers (i.e. age = 32.234532 years) but CCHS includes only integers (i.e. age =32). Therefore, an additional data type supporting integer was added to the mock data configuration file.
  • survival variables, such as the date of disease onset or death. This data may come from either csv files as characters or other statistical data files, such as .rdata or SAS data file where dates are already coded as date or other temporal data types. This necessitated the creation of sourceData and rType, but other potential solutions exist for more generalized support of an expanded range of data types.

Overview

Implements date variable support for DemPoRT v2, along with garbage data generation for validation testing, and major documentation improvements.

Key Changes

New Features:

  • Date variable generation with create_date_var() supporting three distributions (uniform, gompertz, exponential)
  • Survival analysis with create_survival_dates() for cohort studies with temporal ordering
  • Garbage data support via prop_invalid parameter across all variable types for validation testing
  • v0.2 configuration format - expanded for dates and garbage data specifications
  • rType field for proper R type coercion (factor, integer, numeric, Date)
  • Proportion-based generation for all variable types via determine_proportions()

Documentation (Divio Framework):

  • Tutorials: getting-started, tutorial-config-files, tutorial-dates, tutorial-missing-data, tutorial-garbage-data
  • How-to guides: cchs-example, chms-example, demport-example
  • Explanation: dates, advanced-topics
  • Reference: reference-config
  • Updated all vignette frontmatter (authors, callouts, next steps)
  • Standardized heading capitalization (sentence case)
  • Added "recodeflow universe" and Statistics Canada acknowledgements to README

Package Quality:

  • Add renv for reproducible package management (R >= 4.2.0)
  • Update DESCRIPTION: Juan Li as author, recodeflow contributors
  • Add .claude/AI.md for project-specific AI development guidelines
  • Add CONTRIBUTING.md with pkgdown build instructions
  • Improve pkgdown reference page section descriptions

GitHub Actions:

  • Automated pkgdown deployment via GitHub Actions
  • Deploy main branch → root (/)
  • Deploy create-date-var branch → /dev
  • Deploy other branches → /preview/{branch-name}

Bug Fixes:

  • Fixes Issue with generated values #5 - Improved 'else' handling in recEnd rules
  • Fixed continuous variable missing codes to generate proper numeric values

Breaking Changes

⚠️ Configuration format updated to v0.2 - requires additional fields:

  • uid - unique identifier for each variable configuration
  • rType - R type specification (factor, integer, numeric, Date)
  • proportion - distribution weights for categorical values
  • Date variables require role containing "date" and rType = "Date"
  • Date variables use variableType = "Continuous" (for recodeflow compatibility)

Preview

Related

All code examples tested and verified in executable vignettes.

reikookamoto and others added 6 commits October 23, 2025 10:42
Implements create_date_var() with distribution options and prop_invalid parameter for all generators.

**Core date functionality:**
- Parse SAS date formats (YYMMDD10., DATE9., etc.) with support for ranges
- Generate Date objects within specified periods
- Three distribution types:
  - uniform: Flat distribution across range (default)
  - gompertz: Gompertz survival distribution (mortality patterns)
  - exponential: Exponential distribution (early events)
- Support for prop_NA (missing dates using NA codes)

**prop_invalid parameter (new):**
- Added to create_date_var(), create_cat_var(), create_con_var()
- Generates invalid/out-of-range values for testing validation pipelines
- Date variables: 1-5 years before/after valid range
- Categorical: values outside defined categories
- Continuous: values outside min/max range
- Critical for testing data quality workflows

**Parser enhancements:**
- parse_sas_date_format(): Extract min/max dates from SAS formats
- Support for MDY, YMD, DATE formats with various widths
- Handle period ranges like '01JAN2001'd to '31MAR2017'd

**Documentation:**
- New dates.qmd vignette with comprehensive examples
- Distribution comparisons and use cases
- prop_invalid demonstrations for all variable types
- Updated function documentation with @family tags

**Testing:**
- 250 tests passing (61 new tests added)
- Coverage for all distributions
- prop_invalid edge cases
- Date parsing validation

**Vignette improvements:**
- Apply devtools::load_all() pattern to cchs/chms/demport examples
- Consistent with PR #7 vignette standards
…prehensive documentation

This PR implements v0.2 configuration format with date variables, garbage data support, and comprehensive documentation restructuring following the Divio framework.

## Overview
Implements date variable support for DemPoRT v2, along with garbage data generation for validation testing, and major documentation improvements.

## Key Changes

### New Features:
- **Date variable generation** with `create_date_var()` supporting three distributions (uniform, gompertz, exponential)
- **Survival analysis** with `create_survival_dates()` for cohort studies with temporal ordering
- **Garbage data support** via `prop_invalid` parameter across all variable types for validation testing
- **v0.2 configuration format** - expanded for dates and garbage data specifications
- **rType field** for proper R type coercion (factor, integer, numeric, Date)
- **Proportion-based generation** for all variable types via `determine_proportions()`

### Documentation (Divio Framework):
- **Tutorials**: getting-started, tutorial-config-files, tutorial-dates, tutorial-missing-data, tutorial-garbage-data
- **How-to guides**: cchs-example, chms-example, demport-example
- **Explanation**: dates, advanced-topics
- **Reference**: reference-config
- Updated all vignette frontmatter (authors, callouts, next steps)
- Standardized heading capitalization (sentence case)
- Added "recodeflow universe" and Statistics Canada acknowledgements to README
- **Clarified mock vs synthetic data distinctions** with appropriate use cases and limitations

### Package Quality:
- Add renv for reproducible package management (R >= 4.2.0)
- Update DESCRIPTION: Juan Li as author, recodeflow contributors
- Add `.claude/AI.md` for project-specific AI development guidelines
- Add CONTRIBUTING.md with pkgdown build instructions
- Improve pkgdown reference page section descriptions
- **Add Quarto-style callout CSS** for native callout syntax with proper styling in pkgdown

### GitHub Actions:
- Automated pkgdown deployment via GitHub Actions
- Deploy `main` branch → root (/)
- Deploy `create-date-var` branch → /dev
- Deploy other branches → /preview/{branch-name}
- **Fix locale issue** (en_CA → en_US.UTF-8) for Ubuntu compatibility
- **Use r-lib/actions/setup-r-dependencies@v2** for dependency management
- Generate documentation with roxygen2::roxygenize() before build

### Bug Fixes:
- Fixes #5 - Improved 'else' handling in recEnd rules
- Fixed continuous variable missing codes to generate proper numeric values
- Fixed locale issues in date parsing for cross-platform compatibility

### Breaking Changes
⚠️ Configuration format updated to v0.2 - requires additional fields:
- `uid` - unique identifier for each variable configuration
- `rType` - R type specification (factor, integer, numeric, Date)
- `proportion` - distribution weights for categorical values
- Date variables require `role` containing "date" and `rType = "Date"`
- Date variables use `variableType = "Continuous"` (for recodeflow compatibility)

## Files Changed
- R/create_date_var.R - locale fix for cross-platform compatibility
- R/create_survival_dates.R - v0.2 format implementation
- .github/workflows/pkgdown.yaml - comprehensive workflow improvements
- pkgdown/extra.css - NEW: Quarto-style callout styling
- _pkgdown.yml - add CSS include, improve reference sections
- README.md - comprehensive mock data documentation
- vignettes/*.qmd - updated all 11 vignettes
- .claude/AI.md - documented debugging lessons learned

All code examples tested and verified in executable vignettes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@DougManuel
Copy link
Collaborator Author

In your README or somewhere in the repo and section or file that describes the organization of the repo indicating what goes where and why. Specially, describe the purpose of the scripts folder and what goes there.

@rafdoodle
Copy link

rafdoodle commented Nov 8, 2025

In your README or somewhere in the repo and section or file that describes the organization of the repo indicating what goes where and why. Specially, describe the purpose of the scripts folder and what goes there.

See this commit

@DougManuel DougManuel closed this Nov 18, 2025
@DougManuel DougManuel deleted the create-date-var branch November 24, 2025 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Issue with generated values

4 participants