-
Notifications
You must be signed in to change notification settings - Fork 0
Dataset.jl and Corresponding Tests #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
7dc2968
47fc954
787a88d
2dd2295
173cdd1
042d5d6
70cfa67
9b8c48c
bf9707b
214e74d
a87c905
9e1ad08
13fc5bf
5cbeb0b
121fbc0
049bacc
fa3a24a
e1b0106
17a8d48
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,16 @@ | ||
| name = "RidgeRegression" | ||
| uuid = "739161c8-60e1-4c49-8f89-ff30998444b1" | ||
| authors = ["Vivak Patel <vp314@users.noreply.github.com>"] | ||
| version = "0.1.0" | ||
| authors = ["Eton Tackett <etont@icloud.com>", "Vivak Patel <vp314@users.noreply.github.com>"] | ||
|
|
||
| [deps] | ||
| CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b" | ||
| DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" | ||
| Downloads = "f43a241f-c20a-4ad4-852c-f6b1247861c6" | ||
| LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e" | ||
|
|
||
| [compat] | ||
| CSV = "0.10.15" | ||
| DataFrames = "1.8.1" | ||
| Downloads = "1.7.0" | ||
| julia = "1.12.4" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -14,6 +14,7 @@ makedocs(; | |
| ), | ||
| pages=[ | ||
| "Home" => "index.md", | ||
| "Design" => "design.md", | ||
| ], | ||
| ) | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,93 @@ | ||
| # Motivation and Background | ||
| Many modern science problems involve regression problems with extremely large numbers of predictors. Genome-wide association studies (GWAS), for example, try to identify genetic variants associated with a disease phenotype using hundreds of thousands or millions of genomic features. In such settings, traditional least squares methods fail because noise and ill-conditioning. Penalized Least Squares (PLS) extends ordinary least squares (OLS) regression by adding a penalty term to shrink parameter estimates. Ridge regression, an approach within PLS, adds a regularization term, producing a regularized estimator. | ||
|
|
||
| Mathematically, ridge regression estimates the regression coefficients by solving the penalized least squares problem | ||
| ${ | ||
| \hat{\boldsymbol{\beta}} = | ||
| \arg\min_{\boldsymbol{\beta}} | ||
| \left( | ||
| \| \mathbf{y} - X\boldsymbol{\beta} \|^2 | ||
| + | ||
| \lambda \| \boldsymbol{\beta} \|^2 | ||
| \right)} | ||
| $ | ||
| where $\lambda > 0$ is a regularization parameter that controls the strength of the penalty. | ||
|
|
||
| The purpose of ridge regression is to stabilize regression estimates where the predictors are highly correlated or the design matrix $X$ is almost singular. Ridge regression shrinks the estimated coefficient vector in a way such that the coefficient estimates minimize the sum of squared residuals subject to a constraint on the $\ell_2$ norm of the coefficient vector, $\|\boldsymbol{\beta}\|^2 \leq t$, which shrinks the least squares estimates toward the origin. This reduces the variance of the coefficient estimates and mitigates the effects of multicollinearity. | ||
|
|
||
| There are many numerical algorithms available to compute ridge regression estimates including direct methods, Krylov subspace methods, gradient-based optimization, coordinate descent, and stochastic gradient descent. These algorithms differ in their computational costs and numerical stability. | ||
|
|
||
| The goal of this experiment is to investigate the performance of these algorithms when we vary the structure and scale of the regression problem. To do this, we consider the linear model $\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ where the matrix ${X}$ may be constructed with varying dimensions, sparsity patterns, and conditioning properties. | ||
| # Questions | ||
| The primary goal of this experiment is to compare numerical algorithms for computing ridge regression estimates under various conditions. In particular, we aim to address the following questions: | ||
|
|
||
| 1. How does the performance of ridge regression algorithms change as the structural and numerical properties of the regression problem vary? | ||
|
|
||
| 2. Which ridge regression algorithm provides the best balance between numerical stability and computational cost across these problem regimes? | ||
|
|
||
| # Experimental Units | ||
| The experimental units are the datasets under fixed penalty weights. For each experimental unit, all treatments will be applied to the dataset. This will be done so that differences in performance can be attributed to the algorithms themselves rather than the data. Each dataset will contain a matrix ${X}$, a response vector $\mathbf{y}$, and a regularization parameter ${\lambda}$ for some specific ${\lambda}$. | ||
|
|
||
| Blocks are defined by combinations of the experimental blocking factors, including dimensional regime, matrix sparsity, and ridge penalty magnitude. Each block represents datasets with similar structural properties. Within each block, multiple datasets will be generated, and each dataset forms an experimental unit. For every experimental unit all treatments are applied. | ||
|
|
||
| Datasets will be grouped according to their dimensional regime, characterized as $p \ll n$, p ≈ n, and $p \gg n$. These regimes correspond to fundamentally different geometric properties of the design matrix, including rank behavior, conditioning, and the stability of the normal equations. | ||
|
|
||
| In addition to dimensional block, the strength of the ridge penalty will be incorporated as a secondary blocking factor. The ridge estimator is $\hat{\beta_R} = (X^\top X + \lambda I)^{-1}X^\top y$. The matrix conditioning number is defined as $\kappa(A) = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}$. In the context of ridge regression, the regularization parameter ${\lambda}$, can impact the conditioning number. Let $X = U\Sigma V^\top$ be the SVD of $X$, with singular values $\sigma_1,\dots,\sigma_p$. | ||
|
|
||
| Then | ||
| ```math | ||
| X^\top X = V \Sigma^\top \Sigma V^\top | ||
| = V \,\mathrm{diag}(\sigma_1^2,\dots,\sigma_p^2)\, V^\top . | ||
| ``` | ||
|
|
||
| Adding the ridge term gives | ||
|
|
||
| ```math | ||
| X^\top X + \lambda I | ||
| = | ||
| V \,\mathrm{diag}(\sigma_1^2+\lambda,\dots,\sigma_p^2+\lambda)\, V^\top . | ||
| ``` | ||
|
|
||
| ```math | ||
| \kappa_2(X^\top X+\lambda I) | ||
| = | ||
| \frac{\sigma_{\max}^2+\lambda}{\sigma_{\min}^2+\lambda}. | ||
| ``` | ||
|
|
||
| Because the performance of numerical algorithms is strongly influenced by the conditioning of the system they solve, the ridge penalty effectively creates regression problems with different numerical difficulty. This provides a way to assess how algorithm performance, convergence behavior, and computational cost depend on the numerical stability of the problem. In this experiment, the magnitude of $\lambda$ is selected relative to the smallest and largest singular values of $X$. A weak regularization regime corresponds to $\lambda \approx \sigma_{\min}^2$, where the ridge penalty begins to influence the smallest singular directions but the system remains moderately ill-conditioned. A moderate regularization regime corresponds to $\lambda \approx \sigma_{\min}\sigma_{\max}$, which substantially improves the conditioning of the problem by increasing the smallest eigenvalues of $X^\top X + \lambda I$. Finally, a strong regularization regime corresponds to $\lambda \approx \sigma_{\max}^2$, where the ridge penalty dominates the spectral scale of the problem and produces a well-conditioned system. | ||
|
|
||
| Another blocking factor that will be considered is how sparse or dense the matrix $X$ is. Many algorithms behave differently depending on whether the matrix is sparse or dense. In ridge regression, there are many operations involving $X$ including matrix-matrix products and matrix-vector products. A dense matrix leads to high computational cost whereas a sparse matrix we can significantly reduce the cost. As such, different algorithms may perform better depending on the sparsity structure of X, making matrix sparsity a relevant blocking factor when comparing algorithm behavior and computational efficiency. | ||
|
|
||
| The total number of block combinations is determined by the product of the number of levels in each blocking factor, denoted b. For example, if the experiment includes three dimensional regimes, two sparsity levels, and two regularization strengths, then there are $3 * 2 * 2 = 12$ block combinations. We will also denote r to be the number of replicated datasets in each block. Here, we mean the number datasets within a block. The total number of experimental units is then ${b * r}$. | ||
|
|
||
| | Blocking System | Factor | Blocks | | ||
| |:----------------|:-------|:-------| | ||
| | Dataset | Dimensional regime | $(p \ll n)$, $(p \approx n)$, $(p \gg n)$| | ||
| | Ridge Penalty | Magnitude of ${\lambda}$ relative to the spectral scale of $X^\top X$ | Weak ($\lambda \approx \sigma_{\min}^2$), Moderate ($\lambda \approx \sigma_{\min}\sigma_{\max}$), Strong ($\lambda \approx \sigma_{\max}^2$), where $\sigma_{\min}$ and $\sigma_{\max}$ denote the smallest and largest singular values of $X$. | | ||
| | Matrix Sparsity| Density of non-zero values in $X$ | Sparse (< 10% non-zero), Moderate (10%-50% non-zero), Dense (> 50% non-zero)| | ||
| # Treatments | ||
|
|
||
| The treatments are the ridge regression solution methods: | ||
|
|
||
| - Gradient-based optimization | ||
| - Stochastic gradient descent | ||
| - Direct Methods | ||
| - Golub Kahan Bidiagonalization | ||
|
|
||
| Since each experimental unit will recieves all t treatments, the total number of algorithm runs in the experiment is ${t * b * r}$. For this experiment, ${t=3}$. To ensure fair comparison between algorithms, each treatment will be applied under a fixed time constraint. Each algorithm will be run for a maximum of two hours per experimental unit. | ||
| # Observational Units and Measurements | ||
|
|
||
| The observational units are each algorithm-dataset pair. For each combination we will observe the following | ||
|
|
||
| | Column Name | Data Type | Description | | ||
| |:---|:---|:---| | ||
| | `dataset_id` | Positive Integer | Identifier for the generated dataset (experimental unit). | | ||
| | `dimensional_regime` | String | Relationship between predictors and observations: `p << n`, `p ≈ n`, or `p >> n`. | | ||
| | `sparsity_level` | String | Density of the matrix `X`: `Sparse`, `Moderate`, or `Dense`. | | ||
| | `lambda_level` | String | Relative magnitude of the ridge penalty parameter `λ`: `Weak`, `Moderate`, or `Strong`. | | ||
| | `algorithm` | String | Ridge regression solution method used: `GradientDescent`, `SGD`, or `DirectMethod`. | | ||
| | `runtime_seconds` | Positive Floating-point | Time required for the algorithm to compute a solution. | | ||
| | `iterations` | Positive Integer | Number of iterations performed by the algorithm (`NA` for direct methods). | | ||
|
|
||
|
|
||
| The collected measurements will be written to a CSV file. Each row in the file corresponds to a single algorithm–dataset pair, which forms the observational unit of the experiment. The columns represent the recorded measurements. After the experiment, the resulting CSV file should contain ${Algorithms∗Datasets}$ number of rows and each row will contain exactly seven columns. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,12 @@ | ||
| module RidgeRegression | ||
|
|
||
| # Write your package code here. | ||
| using CSV | ||
| using DataFrames | ||
| using Downloads | ||
| using LinearAlgebra | ||
|
|
||
| include("dataset.jl") | ||
|
|
||
| export Dataset, load_csv_dataset, one_hot_encode | ||
|
|
||
| end |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,144 @@ | ||||||
| """ | ||||||
| Dataset <: ExperimentalUnit | ||||||
|
|
||||||
| A dataset for Ridge Regression experiements. | ||||||
|
etontackett marked this conversation as resolved.
|
||||||
|
|
||||||
| # Description | ||||||
|
|
||||||
| A `Dataset` object stores the design matrix ``X`` and response vector ``y`` | ||||||
| for a regression problem. These datasets serve as the experimental units for ridge regression experiments, allowing us to evaluate the performance of ridge regression models on various datasets. | ||||||
|
Comment on lines
+4
to
+9
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should consolidate this into a few sentences that describes the nature of what is going on as concisely as possible. |
||||||
|
|
||||||
| # Fields | ||||||
| - `name::String`: Name of dataset | ||||||
| - `X::TX`: Matrix of variables/features | ||||||
| - `y::TY`: Target vector | ||||||
|
|
||||||
| # Constructor | ||||||
|
|
||||||
| Dataset(name::String, X::AbstractMatrix, y::AbstractVector) | ||||||
|
|
||||||
| ## Arguments | ||||||
| - `name::String`: Name of dataset | ||||||
| - `X::TX`: Matrix of variables/features | ||||||
| - `y::TY`: Target vector | ||||||
|
|
||||||
| ## Returns | ||||||
| - A `Dataset` object containing the numeric design matrix and response vector. | ||||||
|
|
||||||
| ## Throws | ||||||
| - `ArgumentError`: If rows in `X` does not equal length of `y`. | ||||||
|
Comment on lines
+1
to
+29
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There should be documentation for the struct being created and then there should be documentation for the constructor in the same docstring. |
||||||
|
|
||||||
| !!! note | ||||||
| `Dataset` objects are used as experimental units when evaluating | ||||||
| ridge regression algorithms. The parametric design allows both dense | ||||||
| and sparse matrices to be stored without forcing conversion to a | ||||||
| dense `Matrix{Float64}`. | ||||||
| """ | ||||||
| struct Dataset{TX<:AbstractMatrix, TY<:AbstractVector} | ||||||
| name::String | ||||||
| X::TX | ||||||
| y::TY | ||||||
|
|
||||||
| function Dataset(name::String, X::TX, y::TY) where {TX<:AbstractMatrix, TY<:AbstractVector} | ||||||
| size(X, 1) == length(y) || | ||||||
| throw(ArgumentError("X and y must have same number of rows")) | ||||||
|
|
||||||
| new{TX, TY}(name, X, y) | ||||||
| end | ||||||
| end | ||||||
|
|
||||||
| """ | ||||||
| one_hot_encode(Xdf::DataFrame; drop_first=true) | ||||||
|
|
||||||
| One-hot encode categorical (string-like) features in `Xdf`. | ||||||
|
|
||||||
| # Arguments | ||||||
| - `Xdf::DataFrame`: Input DataFrame containing features and response vector `y`. | ||||||
|
|
||||||
| # Keyword Arguments | ||||||
| - `cols_to_encode`: A collection of column names or indices to one-hot encode. | ||||||
| - `drop_first::Bool=true`: If `true`, drop the first dummy column for | ||||||
| each categorical feature to avoid multicollinearity. | ||||||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The indenting here is not consistent with the indent level in line 51. Please check the bluestyle guide to understand how much indentation is needed. |
||||||
|
|
||||||
| # Returns | ||||||
| - `Matrix{Float64}`: A numeric matrix containing the encoded feature. | ||||||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| """ | ||||||
| function one_hot_encode(Xdf::DataFrame; cols_to_encode, drop_first::Bool = true)::Matrix{Float64} | ||||||
| n = nrow(Xdf) | ||||||
| cols = Vector{Vector{Float64}}() | ||||||
| encode_names = Set(c isa Int ? Symbol(names(Xdf)[c]) : Symbol(c) for c in cols_to_encode) | ||||||
|
|
||||||
|
|
||||||
| for name in names(Xdf) #Selecting columns that aren't the target variable and pushing them to the columns. | ||||||
| col = Xdf[!, name] | ||||||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe move this inside the first if statement on line 75 |
||||||
| name_sym = Symbol(name) | ||||||
| if name_sym in encode_names | ||||||
| scol = string.(col) # Convert to string for categorical processing. | ||||||
| lv = unique(scol) #Get unique category levels. | ||||||
| ind = scol .== permutedims(lv) #Create indicator matrix for each level of the categorical variable. | ||||||
| #Permutedims is used to align the dimensions for broadcasting. | ||||||
| #Broadcasting compares each element of `scol` with each level in `lv`, resulting in a matrix where each column corresponds to a level and contains `true` for rows that match that level and `false` otherwise. | ||||||
|
|
||||||
| if drop_first && size(ind, 2) > 1 #Drop the first column of the indicator matrix to avoid multicollinearity if drop_first is true and there are multiple levels. | ||||||
| ind = ind[:, 2:end] | ||||||
| end | ||||||
|
|
||||||
| for j in 1:size(ind, 2) | ||||||
| push!(cols, Float64.(ind[:, j])) #Convert the boolean indicator columns to Float64 and add them to the list of columns. | ||||||
| end | ||||||
| else | ||||||
| eltype(col) <: Real || | ||||||
| throw(ArgumentError("Column $name must be numeric unless it is listed in cols_to_encode")) | ||||||
|
|
||||||
| push!(cols, Float64.(col)) | ||||||
| end | ||||||
| end | ||||||
|
|
||||||
| p = length(cols) | ||||||
| X = Matrix{Float64}(undef, n, p) | ||||||
| for j in 1:p | ||||||
| X[:, j] = cols[j] | ||||||
| end | ||||||
|
|
||||||
| return Matrix{Float64}(X) | ||||||
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should have an intercept column (column of 1s) prepended to X. I would do this higher up. Probably around Line 68 |
||||||
|
|
||||||
| end | ||||||
| """ | ||||||
| load_csv_dataset(path_or_url; target_col, name="csv_dataset") | ||||||
|
etontackett marked this conversation as resolved.
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Signatures should include types as you have done previously
etontackett marked this conversation as resolved.
|
||||||
|
|
||||||
| Load a dataset from a CSV file or URL. | ||||||
|
|
||||||
| # Arguments | ||||||
| - `path_or_url::String`: Local file path or web URL containing CSV data. | ||||||
|
|
||||||
| # Keyword Arguments | ||||||
| - `cols_to_encode=Symbol[]`: Column names or indices in the feature data to one-hot encode. | ||||||
| - `target_col`: Column index or column name containing the response variable. | ||||||
| - `name::String="csv_dataset"`: Dataset name. | ||||||
|
|
||||||
| # Returns | ||||||
| - `Dataset`: A dataset containing the encoded feature matrix `X`, response vector `y`, and dataset name. | ||||||
| """ | ||||||
| function load_csv_dataset(path_or_url::String; cols_to_encode=Symbol[], target_col, name::String = "csv_dataset") | ||||||
|
etontackett marked this conversation as resolved.
|
||||||
|
|
||||||
| filepath = | ||||||
| startswith(path_or_url, "http") ? | ||||||
| Downloads.download(path_or_url) : | ||||||
| path_or_url | ||||||
|
|
||||||
| df = DataFrame(CSV.File(filepath)) #Read CSV file into a DataFrame. | ||||||
| df = dropmissing(df) #Remove rows with missing values. | ||||||
| Xdf = select(df, DataFrames.Not(target_col)) #Select all columns except the target column for features. | ||||||
|
|
||||||
| y = target_col isa Int ? | ||||||
| df[:, target_col] : #If target_col is an integer, use it as a column index to extract the target variable from the DataFrame. | ||||||
| df[:, Symbol(target_col)] #Extract the target variable based on whether target_col is an index or a name. | ||||||
|
|
||||||
|
|
||||||
| feature_names = names(Xdf) | ||||||
| encode_cols = [c isa Int ? Symbol(names(Xdf)[c]) : Symbol(c) for c in cols_to_encode] | ||||||
| X = one_hot_encode(Xdf; cols_to_encode=encode_cols, drop_first = true) | ||||||
|
etontackett marked this conversation as resolved.
|
||||||
|
|
||||||
|
|
||||||
| return Dataset(name, X, collect(Float64, y)) | ||||||
| end | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,2 +1,9 @@ | ||
| [deps] | ||
| CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b" | ||
| DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" | ||
| Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40" | ||
| LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e" | ||
|
|
||
| [compat] | ||
| CSV = "0.10" | ||
| DataFrames = "1" |
|
etontackett marked this conversation as resolved.
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Individual test files should be wrapped as their own modules. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| @testset "Dataset constructor stores fields correctly" begin | ||
| X = [1 2; 3 4] | ||
| y = [10, 20] | ||
| d = Dataset("toy", X, y) | ||
|
|
||
| @test "toy" == d.name | ||
| @test X == d.X | ||
| @test y == d.y | ||
| @test (2, 2) == size(d.X) | ||
| @test 2 == length(d.y) | ||
| @test 1.0 == d.X[1, 1] | ||
| @test 20.0 == d.y[2] | ||
| end | ||
|
|
||
| @testset "Dataset constructor throws error for mismatched dimensions" begin | ||
| X = [1 2; 3 4] | ||
|
|
||
| @test_throws ArgumentError Dataset("bad", X, [1, 2, 3]) | ||
| end |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| @testset "one_hot_encode encodes specified categorical columns and keeps numeric columns" begin | ||
| df = DataFrame( | ||
| A = ["red", "blue", "red", "green"], | ||
| B = [1, 2, 3, 4], | ||
| C = ["small", "large", "medium", "small"] | ||
| ) | ||
|
|
||
| X = one_hot_encode(df; cols_to_encode=[:A, :C], drop_first=true) | ||
|
|
||
| @test (4, 5) == size(X) | ||
| @test [1.0, 2.0, 3.0, 4.0] == X[:, 3] | ||
| @test all(x -> x == 0.0 || x == 1.0, X[:, [1, 2, 4, 5]]) | ||
| @test all(vec(sum(X[:, 1:2]; dims=2)) .<= 1) | ||
| @test all(vec(sum(X[:, 4:5]; dims=2)) .<= 1) | ||
| end | ||
|
|
||
| @testset "one_hot_encode throws error for invalid column specifications" begin | ||
| df = DataFrame( | ||
| A = ["red", "blue", "red", "green"], | ||
| B = [1, 2, 3, 4], | ||
| C = ["small", "large", "medium", "small"] | ||
| ) | ||
|
|
||
| @test_throws ArgumentError one_hot_encode(df; cols_to_encode=[:A], drop_first=true) | ||
| end | ||
|
|
||
| @testset "one_hot_encode supports integer-coded categorical columns when specified" begin | ||
| df = DataFrame( | ||
| group = [1, 2, 1, 3], | ||
| x = [10.0, 20.0, 30.0, 40.0] | ||
| ) | ||
|
|
||
| X = one_hot_encode(df; cols_to_encode=[:group], drop_first=true) | ||
|
|
||
| @test (4, 3) == size(X) | ||
| @test [10.0, 20.0, 30.0, 40.0] == X[:, 3] | ||
| @test all(x -> x == 0.0 || x == 1.0, X[:, 1:2]) | ||
| end |
|
Owner
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where do you test for missing values? |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,38 @@ | ||
| @testset "load_csv_dataset drops missing rows and uses target column" begin | ||
| tmp = tempname() * ".csv" | ||
|
|
||
| df = DataFrame( | ||
| a = [1.0, 2.0, missing, 4.0], | ||
| b = ["x", "y", "y", "x"], | ||
| y = [10.0, 20.0, 30.0, 40.0] | ||
| ) | ||
|
|
||
| CSV.write(tmp, df) | ||
|
|
||
| d = load_csv_dataset(tmp; target_col=:y, cols_to_encode=[:b], name="tmp") | ||
|
|
||
| @test "tmp" == d.name | ||
| @test 3 == length(d.y) | ||
| @test 3 == size(d.X, 1) | ||
| @test [10.0, 20.0, 40.0] == d.y | ||
| @test (3, 2) == size(d.X) | ||
| end | ||
|
|
||
| @testset "load_csv_dataset drops missing rows and uses target column by index" begin | ||
| tmp = tempname() * ".csv" | ||
|
|
||
| df = DataFrame( | ||
| a = [1.0, 2.0, missing, 4.0], | ||
| b = ["x", "y", "y", "x"], | ||
| y = [10.0, 20.0, 30.0, 40.0] | ||
| ) | ||
|
|
||
| CSV.write(tmp, df) | ||
|
|
||
| d = load_csv_dataset(tmp; target_col=3, cols_to_encode=[:b], name="tmp2") | ||
|
|
||
| @test "tmp2" == d.name | ||
| @test [10.0, 20.0, 40.0] == d.y | ||
| @test 3 == size(d.X, 1) | ||
| @test (3, 2) == size(d.X) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All dependencies should appear in the Project.toml file. You should activate the package environment and then "add ..." your dependencies to ensure compatibility and correct environment for the package.