Skip to content

Latest commit

 

History

History
905 lines (680 loc) · 35.3 KB

File metadata and controls

905 lines (680 loc) · 35.3 KB

API Reference

This document provides an overview of all modules and public functions in the statcpp library (524 public functions across 31 header files).

For detailed function specifications, formulas, and usage examples, please refer to the HTML documentation generated by Doxygen.

Generating Documentation

To generate detailed API documentation:

# Install Doxygen (if necessary)
# macOS
brew install doxygen

# Ubuntu/Debian
sudo apt-get install doxygen

# Generate documentation
./generate_docs.sh

# Or run directly
doxygen Doxyfile

# Open in browser (macOS)
open doc/html/index.html

# Linux
xdg-open doc/html/index.html

Module List

1. Basic Statistics (basic_statistics.hpp)

Functions for computing basic statistical measures.

Functions:

Function Description Overloads
sum Sum of elements + projection
count Number of elements
mean Arithmetic mean + projection
median Median (requires sorted range) + projection
mode Mode (most frequent value) + projection
modes All modes + projection
geometric_mean Geometric mean + projection
harmonic_mean Harmonic mean + projection
trimmed_mean Trimmed mean (requires sorted range) + projection
weighted_mean Weighted mean + projection
logarithmic_mean Logarithmic mean of two values
weighted_harmonic_mean Weighted harmonic mean + projection
argmin Index of minimum element + projection
argmax Index of maximum element + projection

2. Dispersion & Spread (dispersion_spread.hpp)

Functions for measuring data dispersion (variability).

Functions:

Function Description Overloads
range Range (max - min) + projection
var Variance with ddof parameter + precomputed mean, + projection
population_variance Population variance + precomputed mean, + projection
sample_variance Sample variance + precomputed mean, + projection
variance Variance (defaults to sample) + precomputed mean, + projection
stdev Standard deviation with ddof parameter + precomputed mean, + projection
population_stddev Population standard deviation + precomputed mean, + projection
sample_stddev Sample standard deviation + precomputed mean, + projection
stddev Standard deviation (defaults to sample) + precomputed mean, + projection
coefficient_of_variation Coefficient of variation + precomputed mean, + projection
iqr Interquartile range (requires sorted range) + projection
mean_absolute_deviation Mean absolute deviation + precomputed mean, + projection
weighted_variance Weighted variance + projection
weighted_stddev Weighted standard deviation + projection
geometric_stddev Geometric standard deviation + projection

3. Order Statistics (order_statistics.hpp)

Functions for computing order statistics (all require sorted range).

Structs:

  • quartile_result — fields: q1, q2, q3
  • five_number_summary_result — fields: min, q1, median, q3, max

Functions:

Function Description Overloads
interpolate_at Interpolate value at fractional index + projection
minimum Minimum value + projection
maximum Maximum value + projection
quartiles Quartiles (Q1, Q2, Q3) + projection
percentile Percentile at given proportion + projection
five_number_summary Five-number summary + projection
weighted_median Weighted median + projection
weighted_percentile Weighted percentile + projection

4. Shape of Distribution (shape_of_distribution.hpp)

Statistics characterizing distribution shape.

Functions:

Function Description Overloads
population_skewness Population skewness + precomputed mean, + projection
sample_skewness Sample skewness + precomputed mean, + projection
skewness Skewness (defaults to sample) + precomputed mean, + projection
population_kurtosis Population kurtosis + precomputed mean, + projection
sample_kurtosis Sample kurtosis + precomputed mean, + projection
kurtosis Kurtosis (defaults to sample) + precomputed mean, + projection

5. Correlation & Covariance (correlation_covariance.hpp)

Functions for measuring relationships between two variables.

Functions:

Function Description Overloads
population_covariance Population covariance + precomputed means, + projection
sample_covariance Sample covariance + precomputed means, + projection
covariance Covariance (defaults to sample) + precomputed means, + projection
pearson_correlation Pearson correlation coefficient + precomputed means, + projection
spearman_correlation Spearman rank correlation + projection
kendall_tau Kendall's rank correlation + projection
weighted_covariance Weighted covariance + projection

Note: weighted_covariance assumes frequency weights (repeat counts). The Bessel correction formula W / (W² − Σwᵢ²) is designed for this weight type and reduces to n/(n−1) when all weights equal 1. For precision weights (inverse-variance) or reliability weights, a different correction is required.

6. Frequency Distribution (frequency_distribution.hpp)

Creating frequency distributions and histograms.

Structs:

  • frequency_entry<T> — fields: value, count, relative_frequency, cumulative_count, cumulative_relative_frequency
  • frequency_table_result<T> — fields: entries, total_count

Functions:

Function Description Overloads
frequency_table Frequency distribution table + projection
frequency_count Frequency count map + projection
relative_frequency Relative frequency map + projection
cumulative_frequency Cumulative frequency + projection
cumulative_relative_frequency Cumulative relative frequency + projection

7. Special Functions (special_functions.hpp)

Special functions used in statistical computations.

Constants:

  • pi, sqrt_2, sqrt_2_pi, log_sqrt_2_pi

Functions:

Function Description
lgamma Log-gamma function
tgamma Gamma function
beta Beta function
lbeta Log-beta function
betainc Regularized incomplete beta function
betaincinv Inverse of regularized incomplete beta function
erf Error function
erfc Complementary error function
norm_cdf Standard normal CDF
norm_quantile Standard normal quantile (inverse CDF)
gammainc_lower Regularized lower incomplete gamma function
gammainc_upper Regularized upper incomplete gamma function
gammainc_lower_inv Inverse of regularized lower incomplete gamma function

8. Random Engine (random_engine.hpp)

Random number generation engine.

Type Aliases:

  • default_random_engine = std::mt19937_64

Traits:

  • is_random_engine<T> — type trait for random engine detection
  • is_random_engine_v<T> — variable template shortcut

Functions:

Function Description
get_random_engine Get thread-local default random engine
set_seed Set random seed
randomize_seed Randomize seed from hardware entropy

9. Continuous Distributions (continuous_distributions.hpp)

PDF, CDF, quantile functions, and random number generation for continuous probability distributions.

Each distribution provides _pdf, _cdf, _quantile, and _rand functions (except studentized range which provides CDF and quantile only, and Cauchy which is also included).

Functions:

Distribution pdf cdf quantile rand
Uniform uniform_pdf uniform_cdf uniform_quantile uniform_rand
Normal normal_pdf normal_cdf normal_quantile normal_rand
Exponential exponential_pdf exponential_cdf exponential_quantile exponential_rand
Gamma gamma_pdf gamma_cdf gamma_quantile gamma_rand
Beta beta_pdf beta_cdf beta_quantile beta_rand
Chi-squared chisq_pdf chisq_cdf chisq_quantile chisq_rand
t t_pdf t_cdf t_quantile t_rand
F f_pdf f_cdf f_quantile f_rand
Log-normal lognormal_pdf lognormal_cdf lognormal_quantile lognormal_rand
Weibull weibull_pdf weibull_cdf weibull_quantile weibull_rand
Studentized range studentized_range_cdf studentized_range_quantile

10. Discrete Distributions (discrete_distributions.hpp)

PMF, CDF, quantile functions, and random number generation for discrete probability distributions.

Each distribution provides _pmf, _cdf, _quantile, and _rand functions.

Utility Functions:

Function Description
log_factorial Log of factorial
log_binomial_coef Log of binomial coefficient
binomial_coef Binomial coefficient

Distribution Functions:

Distribution pmf cdf quantile rand
Binomial binomial_pmf binomial_cdf binomial_quantile binomial_rand
Poisson poisson_pmf poisson_cdf poisson_quantile poisson_rand
Geometric geometric_pmf geometric_cdf geometric_quantile geometric_rand
Hypergeometric hypergeom_pmf hypergeom_cdf hypergeom_quantile hypergeom_rand
Negative binomial nbinom_pmf nbinom_cdf nbinom_quantile nbinom_rand
Bernoulli bernoulli_pmf bernoulli_cdf bernoulli_quantile bernoulli_rand
Discrete uniform discrete_uniform_pmf discrete_uniform_cdf discrete_uniform_quantile discrete_uniform_rand

11. Estimation (estimation.hpp)

Statistical estimation (confidence intervals, sample size calculation).

Structs:

  • confidence_interval — fields: lower, upper, point_estimate, confidence_level

Functions:

Function Description Overloads
standard_error Standard error of the mean + precomputed stddev, + projection
ci_mean Confidence interval for mean (t-distribution) 2 overloads
ci_mean_z Confidence interval for mean (z, known variance)
ci_proportion CI for proportion (Wald method)
ci_proportion_wilson CI for proportion (Wilson method)
ci_variance CI for variance (chi-squared based)
ci_mean_diff CI for difference of means
ci_mean_diff_welch CI for difference of means (Welch)
ci_mean_diff_pooled CI for difference of means (pooled)
ci_proportion_diff CI for difference of proportions
margin_of_error_mean Margin of error for mean 2 overloads
margin_of_error_proportion Margin of error for proportion
margin_of_error_proportion_worst_case Worst-case margin of error
sample_size_for_moe_proportion Sample size for desired MOE (proportion)
sample_size_for_moe_mean Sample size for desired MOE (mean)

12. Parametric Tests (parametric_tests.hpp)

Parametric hypothesis tests.

Enums:

  • alternative_hypothesis — values: two_sided, less, greater

Structs:

  • test_result — fields: statistic, p_value, df, alternative, df2

Functions:

Function Description
z_test One-sample z-test (known variance)
z_test_proportion One-sample proportion z-test
z_test_proportion_two_sample Two-sample proportion z-test
t_test One-sample t-test
t_test_two_sample Two-sample t-test (pooled variance)
t_test_welch Two-sample t-test (Welch method)
t_test_paired Paired t-test
chisq_test_gof Chi-squared goodness-of-fit test
chisq_test_gof_uniform Chi-squared GOF (uniform expected)
chisq_test_independence Chi-squared test of independence
f_test F-test for equality of variances
bonferroni_correction Bonferroni p-value correction
benjamini_hochberg_correction Benjamini-Hochberg FDR correction
holm_correction Holm step-down correction

13. Nonparametric Tests (nonparametric_tests.hpp)

Nonparametric hypothesis tests.

Functions:

Function Description
compute_ranks_with_ties Compute ranks with tie handling
compute_tie_groups Compute tie group information
shapiro_wilk_test Shapiro-Wilk normality test
lilliefors_test Lilliefors normality test
ks_test_normal (deprecated — use lilliefors_test)
levene_test Levene's test for homogeneity of variance
bartlett_test Bartlett's test for homogeneity of variance
wilcoxon_signed_rank_test Wilcoxon signed-rank test
mann_whitney_u_test Mann-Whitney U test
kruskal_wallis_test Kruskal-Wallis test
fisher_exact_test Fisher's exact test (2x2 table)

Note (tie detection): Rank-based functions (compute_ranks_with_ties, wilcoxon_signed_rank_test, mann_whitney_u_test, kruskal_wallis_test, spearman_correlation, kendall_tau) use exact floating-point equality (==) for tie detection. This is appropriate for observed data (integers, fixed-precision decimals) and consistent with R's behavior. If input values are the result of floating-point arithmetic, they may not be recognized as ties. Round or quantize such data before passing it to these functions.

Note (Lilliefors test): lilliefors_test uses an asymptotic approximation for p-value calculation, which may be imprecise for small samples (n < 20) or in extreme tail regions. For small samples, consider using shapiro_wilk_test as an alternative.

14. Effect Size (effect_size.hpp)

Effect size calculations.

Enums:

  • effect_size_magnitude — values: negligible, small, medium, large

Functions:

Function Description
cohens_d Cohen's d (one-sample, 3 overloads)
cohens_d_two_sample Cohen's d (two-sample, pooled SD)
hedges_correction_factor Hedges' g bias correction factor
hedges_g Hedges' g (one-sample, bias-corrected)
hedges_g_two_sample Hedges' g (two-sample, bias-corrected)
glass_delta Glass's delta (control group SD)
t_to_r Convert t-value to correlation
d_to_r Convert Cohen's d to correlation
r_to_d Convert correlation to Cohen's d
eta_squared Eta squared from F-test
partial_eta_squared Partial eta squared
omega_squared Omega squared (less biased)
cohens_h Cohen's h for proportions
odds_ratio Odds ratio (2x2 table)
risk_ratio Risk ratio (2x2 table)
interpret_cohens_d Interpret Cohen's d magnitude
interpret_correlation Interpret correlation magnitude
interpret_eta_squared Interpret eta squared magnitude

15. Resampling (resampling.hpp)

Resampling methods.

Structs:

  • bootstrap_result — fields: estimate, standard_error, ci_lower, ci_upper, bias, replicates
  • permutation_result — fields: observed_statistic, p_value, n_permutations, permutation_distribution

Functions:

Function Description
bootstrap_sample Generate a bootstrap sample (2 overloads)
bootstrap Bootstrap estimation with custom statistic
bootstrap_mean Bootstrap for mean
bootstrap_median Bootstrap for median
bootstrap_stddev Bootstrap for standard deviation
bootstrap_bca BCa bootstrap confidence interval
permutation_test_two_sample Two-sample permutation test
permutation_test_paired Paired permutation test
permutation_test_correlation Correlation permutation test

16. Power Analysis (power_analysis.hpp)

Power analysis and sample size calculation.

Structs:

  • power_result — fields: power, sample_size, effect_size, alpha

Functions (each has string and alternative_hypothesis enum overloads):

Function Description
power_t_test_one_sample Power for one-sample t-test
sample_size_t_test_one_sample Sample size for one-sample t-test
power_t_test_two_sample Power for two-sample t-test
sample_size_t_test_two_sample Sample size for two-sample t-test
power_prop_test Power for proportion test
sample_size_prop_test Sample size for proportion test
power_analysis_t_one_sample Power analysis returning power_result
power_analysis_t_one_sample_n Sample size analysis returning power_result

Note: The t-test power/sample size functions use a normal distribution approximation. For large samples (n > 30), accuracy is sufficient. For small samples or small effect sizes, the results may slightly overestimate power compared to the exact noncentral t-distribution. For more precise calculations, consider specialized software such as R's pwr package or G*Power.

17. Linear Regression (linear_regression.hpp)

Linear regression analysis.

Structs:

  • simple_regression_result — fields: intercept, slope, intercept_se, slope_se, intercept_t, slope_t, intercept_p, slope_p, r_squared, adj_r_squared, residual_se, f_statistic, f_p_value, df_regression, df_residual, ss_total, ss_regression, ss_residual
  • multiple_regression_result — fields: coefficients, coefficient_se, t_statistics, p_values, r_squared, adj_r_squared, residual_se, f_statistic, f_p_value, df_regression, df_residual, ss_total, ss_regression, ss_residual
  • prediction_interval — fields: prediction, lower, upper, se_prediction
  • residual_diagnostics — fields: residuals, standardized_residuals, studentized_residuals, hat_values, cooks_distance, durbin_watson

Functions:

Function Description
simple_linear_regression Simple linear regression
multiple_linear_regression Multiple linear regression
predict Prediction (2 overloads: simple, multiple)
prediction_interval_simple Prediction interval for simple regression
confidence_interval_mean CI for mean response in simple regression
compute_residual_diagnostics Residual diagnostics (2 overloads)
compute_vif Variance inflation factor
correlation_matrix_determinant Determinant of correlation matrix
multicollinearity_score Multicollinearity assessment score
r_squared Coefficient of determination
adjusted_r_squared Adjusted R-squared

18. ANOVA (anova.hpp)

Analysis of variance.

Structs:

  • anova_row — fields: source, ss, df, ms, f_statistic, p_value
  • one_way_anova_result — fields: between, within, ss_total, df_total, n_groups, n_total, grand_mean, group_means, group_sizes
  • two_way_anova_result — fields: factor_a, factor_b, interaction, error, ss_total, df_total, levels_a, levels_b, n_total, grand_mean
  • posthoc_comparison — fields: group1, group2, mean_diff, se, statistic, p_value, lower, upper, significant
  • posthoc_result — fields: method, comparisons, alpha, mse, df_error
  • ancova_result — fields: ss_covariate, ss_treatment, ss_error, df_covariate, df_treatment, df_error, ms_covariate, ms_treatment, ms_error, f_covariate, f_treatment, p_covariate, p_treatment, adjusted_means

Functions:

Function Description
one_way_anova One-way ANOVA
two_way_anova Two-way ANOVA
tukey_hsd Tukey HSD post-hoc test
bonferroni_posthoc Bonferroni post-hoc test
dunnett_posthoc Dunnett post-hoc test
scheffe_posthoc Scheffe post-hoc test
one_way_ancova One-way ANCOVA
eta_squared Eta squared from ANOVA result
partial_eta_squared_a Partial eta squared for factor A
partial_eta_squared_b Partial eta squared for factor B
partial_eta_squared_interaction Partial eta squared for interaction
omega_squared Omega squared from ANOVA result
cohens_f Cohen's f from ANOVA result

19. GLM (glm.hpp)

Generalized linear models.

Enums:

  • link_function — values: identity, logit, probit, log, inverse, cloglog
  • distribution_family — values: gaussian, binomial, poisson, gamma_family

Structs:

  • glm_result — fields: coefficients, coefficient_se, z_statistics, p_values, null_deviance, residual_deviance, df_null, df_residual, aic, bic, log_likelihood, iterations, converged, link, family
  • glm_residuals — fields: response, pearson, deviance, working

Functions:

Function Description
glm_fit General GLM fitting (IRLS algorithm)
logistic_regression Logistic regression (binomial/logit)
predict_probability Predict probabilities from GLM
odds_ratios Odds ratios from logistic regression
odds_ratios_ci Odds ratios with confidence interval
poisson_regression Poisson regression (poisson/log)
predict_count Predict counts from Poisson model
incidence_rate_ratios IRR from Poisson regression
compute_glm_residuals GLM residual analysis
overdispersion_test Test for overdispersion
pseudo_r_squared_mcfadden McFadden pseudo R-squared
pseudo_r_squared_nagelkerke Nagelkerke pseudo R-squared

20. Model Selection (model_selection.hpp)

Model selection and regularized regression.

Structs:

  • cv_result — fields: mean_error, se_error, fold_errors, n_folds
  • regularized_regression_result — fields: coefficients, lambda, mse, iterations, converged

Functions:

Function Description
aic Akaike information criterion
aic_linear AIC for linear regression (2 overloads)
aicc Corrected AIC
bic Bayesian information criterion
bic_linear BIC for linear regression (2 overloads)
press_statistic PRESS statistic
create_cv_folds Create cross-validation folds
cross_validate_linear Cross-validate linear regression
loocv_linear Leave-one-out cross-validation
ridge_regression Ridge regression
lasso_regression Lasso regression
elastic_net_regression Elastic net regression
cv_ridge Cross-validated ridge regression
cv_lasso Cross-validated lasso regression
generate_lambda_grid Generate regularization parameter grid

21. Distance & Similarity Metrics (distance_metrics.hpp)

Distance and similarity calculations.

Functions:

Function Description Overloads
euclidean_distance Euclidean distance iterator, vector
manhattan_distance Manhattan distance iterator, vector
cosine_similarity Cosine similarity iterator, vector
cosine_distance Cosine distance (1 - similarity) iterator, vector
mahalanobis_distance Mahalanobis distance
minkowski_distance Minkowski distance iterator, vector
chebyshev_distance Chebyshev distance iterator, vector

22. Numerical Utilities (numerical_utils.hpp)

Utility functions for numerical computations.

Constants:

  • epsilon — machine epsilon for double
  • default_rel_tol — default relative tolerance (1e-9)
  • default_abs_tol — default absolute tolerance (1e-12)

Functions:

Function Description
approx_equal Approximate equality for floating-point numbers
is_zero Check if value is approximately zero
is_finite Check if value is finite
all_finite Check if all values in range are finite
has_converged_abs Absolute convergence check
has_converged_rel Relative convergence check
has_converged Combined convergence check
log1p_safe Numerically stable log(1 + x)
expm1_safe Numerically stable exp(x) - 1
clamp Clamp value to range
in_range Check if value is in range
relative_error Relative error between values
safe_divide Safe division (avoids divide by zero)
kahan_sum Kahan summation (2 overloads)
approx_equal_range Approximate equality for ranges

23. Multivariate Analysis (multivariate.hpp)

Multivariate analysis functions.

Structs:

  • pca_result — fields: components, explained_variance, explained_variance_ratio

Functions:

Function Description
covariance_matrix Covariance matrix
correlation_matrix Correlation matrix
standardize Standardize data (z-score)
min_max_scale Min-max scaling
power_iteration Power iteration for eigenvalue
pca Principal component analysis
pca_transform Transform data using PCA result

24. Time Series (time_series.hpp)

Time series analysis functions.

Functions:

Function Description
autocorrelation Autocorrelation at given lag
acf Autocorrelation function (all lags)
pacf Partial autocorrelation function
mae Mean absolute error
mse Mean squared error
rmse Root mean squared error
mape Mean absolute percentage error
moving_average Simple moving average
exponential_moving_average Exponential moving average
diff First differences
seasonal_diff Seasonal differences
lag Lag operator

25. Categorical Data Analysis (categorical.hpp)

Categorical data analysis.

Structs:

  • contingency_table_result — fields: table, row_totals, col_totals, total, n_rows, n_cols
  • odds_ratio_result — fields: odds_ratio, log_odds_ratio, se_log_odds_ratio, ci_lower, ci_upper
  • relative_risk_result — fields: relative_risk, log_relative_risk, se_log_relative_risk, ci_lower, ci_upper
  • risk_difference_result — fields: risk_difference, se, ci_lower, ci_upper

Functions:

Function Description
contingency_table Create contingency table
odds_ratio Odds ratio (table or 2x2 values)
relative_risk Relative risk (table or 2x2 values)
risk_difference Risk difference (table or 2x2 values)
number_needed_to_treat Number needed to treat

26. Survival Analysis (survival.hpp)

Survival analysis functions.

Structs:

  • kaplan_meier_result — fields: times, survival, se, ci_lower, ci_upper, n_at_risk, n_events, n_censored
  • logrank_result — fields: statistic, p_value, df, expected1, expected2, observed1, observed2
  • hazard_rate_result — fields: times, hazard, cumulative_hazard

Functions:

Function Description
kaplan_meier Kaplan-Meier survival estimate
logrank_test Log-rank test
median_survival_time Median survival time
nelson_aalen Nelson-Aalen cumulative hazard estimate

27. Robust Statistics (robust.hpp)

Robust statistical methods.

Structs:

  • outlier_detection_result — fields: outliers, outlier_indices, lower_fence, upper_fence, q1, q3, iqr_value

Functions:

Function Description
mad Median absolute deviation
mad_scaled Scaled MAD (consistent estimator)
detect_outliers_iqr Outlier detection via IQR method
detect_outliers_zscore Outlier detection via z-score
detect_outliers_modified_zscore Outlier detection via modified z-score
winsorize Winsorize data
cooks_distance Cook's distance
dffits DFFITS influence measure
hodges_lehmann Hodges-Lehmann estimator
biweight_midvariance Biweight midvariance

28. Clustering (clustering.hpp)

Clustering algorithms.

Enums:

  • linkage_type — values: single, complete, average, ward

Structs:

  • kmeans_result — fields: labels, centroids, inertia, n_iter
  • dendrogram_node — fields: left, right, distance, count

Functions:

Function Description
euclidean_distance Euclidean distance (vector)
manhattan_distance Manhattan distance (vector)
kmeans_plusplus_init K-means++ initialization
kmeans K-means clustering
hierarchical_clustering Hierarchical clustering
cut_dendrogram Cut dendrogram at k clusters
silhouette_score Silhouette score

29. Data Wrangling (data_wrangling.hpp)

Data transformation and preprocessing.

Constants:

  • NA — NaN sentinel value

Structs:

  • group_result<K,V> — fields: groups
  • aggregation_result<K> — fields: keys, values
  • label_encoding_result<T> — fields: encoded, mapping, classes
  • validation_result — fields: is_valid, n_missing, n_infinite, n_negative, missing_indices, infinite_indices, negative_indices

Functions:

Function Description
is_na Check if value is NA/NaN
dropna Remove NaN values (2 overloads)
fillna Fill NaN with constant
fillna_mean Fill NaN with mean
fillna_median Fill NaN with median
fillna_ffill Forward fill NaN
fillna_bfill Backward fill NaN
fillna_interpolate Interpolate NaN values
filter Filter elements by predicate
filter_rows Filter rows of matrix
filter_range Filter by value range
log_transform Log transformation
log1p_transform Log(1+x) transformation
sqrt_transform Square root transformation
boxcox_transform Box-Cox transformation
rank_transform Rank transformation
group_by Group by key function
group_mean Mean by group
group_sum Sum by group
group_count Count by group
sort_values Sort values
argsort Indices that sort data
sample_with_replacement Random sampling with replacement
sample_without_replacement Random sampling without replacement
stratified_sample Stratified random sampling
drop_duplicates Remove duplicate values
value_counts Count unique values
get_duplicates Get duplicate values
rolling_mean Rolling mean
rolling_std Rolling standard deviation
rolling_min Rolling minimum
rolling_max Rolling maximum
rolling_sum Rolling sum
label_encode Label encoding
one_hot_encode One-hot encoding
bin_equal_width Equal-width binning
bin_equal_freq Equal-frequency binning
validate_data Data validation
validate_range Range validation

30. Missing Data (missing_data.hpp)

Advanced missing data handling.

Enums:

  • missing_mechanism — values: mcar, mar, mnar, unknown

Structs:

  • mcar_test_result — fields: chi_square, p_value, df, is_mcar, interpretation
  • missing_pattern_info — fields: patterns, pattern_counts, missing_rates, overall_missing_rate, n_complete_cases, n_patterns
  • multiple_imputation_result — fields: imputed_datasets, m, pooled_means, pooled_vars, within_vars, between_vars, fraction_missing_info
  • sensitivity_analysis_result — fields: delta_values, estimated_means, estimated_vars, original_mean, original_var, interpretation
  • tipping_point_result — fields: tipping_point, found, threshold, interpretation
  • complete_case_result — fields: complete_data, n_complete, n_dropped, proportion_complete

Functions:

Function Description
analyze_missing_patterns Analyze missing data patterns
create_missing_indicator Create missing indicator matrix
test_mcar_simple Simple MCAR test
diagnose_missing_mechanism Diagnose missing data mechanism
impute_conditional_mean Conditional mean imputation
multiple_imputation_pmm Multiple imputation (PMM)
multiple_imputation_bootstrap Multiple imputation (bootstrap)
sensitivity_analysis_pattern_mixture Pattern mixture sensitivity analysis
sensitivity_analysis_selection_model Selection model sensitivity analysis
find_tipping_point Tipping point analysis
extract_complete_cases Extract complete cases
correlation_matrix_pairwise Pairwise complete correlation matrix

31. Umbrella Header (statcpp.hpp)

Convenience header that includes all modules. No additional functions defined.

#include "statcpp/statcpp.hpp"  // Includes everything

Common Design Principles

Random Access Iterator-Based Interface

Most functions accept STL-style random access iterator pairs (first, last).

std::vector<double> data = {1.0, 2.0, 3.0, 4.0, 5.0};
double avg = statcpp::mean(data.begin(), data.end());

Note: Matrix-based functions (e.g., GLM, ANOVA with design matrices, multiple regression, covariance matrix) use std::vector<std::vector<double>> instead of iterator pairs.

Projection Support

Many functions support projection functions, allowing direct computation on struct members, etc.

struct Point { double x, y; };
std::vector<Point> points = {{1, 2}, {3, 4}, {5, 6}};

// Mean of x coordinates
double avg_x = statcpp::mean(points.begin(), points.end(),
                              [](const Point& p) { return p.x; });

Exception Handling

For invalid input (empty range, out-of-range parameters, etc.), std::invalid_argument is thrown.

std::vector<double> empty;
try {
    double avg = statcpp::mean(empty.begin(), empty.end());
} catch (const std::invalid_argument& e) {
    std::cerr << e.what() << std::endl;  // "statcpp::mean: empty range"
}

Usage Examples

Basic Statistical Analysis

#include "statcpp/basic_statistics.hpp"
#include "statcpp/dispersion_spread.hpp"
#include <vector>
#include <algorithm>

std::vector<double> data = {5, 2, 8, 1, 3, 7, 4};

// Basic statistics
double avg = statcpp::mean(data.begin(), data.end());
double sd = statcpp::stddev(data.begin(), data.end());

// Order statistics (sorting required)
std::sort(data.begin(), data.end());
double med = statcpp::median(data.begin(), data.end());
auto q = statcpp::quartiles(data.begin(), data.end());

Hypothesis Testing

#include "statcpp/parametric_tests.hpp"

std::vector<double> sample1 = {/* data */};
std::vector<double> sample2 = {/* data */};

// Two-sample t-test
auto result = statcpp::t_test_two_sample(
    sample1.begin(), sample1.end(),
    sample2.begin(), sample2.end()
);

std::cout << "t-statistic: " << result.statistic << std::endl;
std::cout << "p-value: " << result.p_value << std::endl;

Linear Regression

#include "statcpp/linear_regression.hpp"

std::vector<double> x = {1, 2, 3, 4, 5};
std::vector<double> y = {2, 4, 5, 4, 5};

auto model = statcpp::simple_linear_regression(
    x.begin(), x.end(),
    y.begin()
);

std::cout << "Intercept: " << model.intercept << std::endl;
std::cout << "Slope: " << model.slope << std::endl;
std::cout << "R²: " << model.r_squared << std::endl;

Next Steps

  • For practical code examples, see Examples
  • For basic usage, see Usage Guide
  • For detailed function specifications, refer to the Doxygen-generated documentation