Skip to content

Array methylation normalization & annotation#438

Open
zitoa wants to merge 72 commits intodevelfrom
enable_450K_meth
Open

Array methylation normalization & annotation#438
zitoa wants to merge 72 commits intodevelfrom
enable_450K_meth

Conversation

@zitoa
Copy link
Copy Markdown
Contributor

@zitoa zitoa commented Feb 11, 2026

Array methylation normalization functions. Playbase.

@zitoa zitoa self-assigned this Feb 11, 2026
@zitoa zitoa added the enhancement New feature or request label Feb 11, 2026
@zitoa zitoa changed the title Array methylation normalization Array methylation normalization & annotation Feb 12, 2026
Copy link
Copy Markdown
Contributor

@ESCRI11 ESCRI11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions work as expected. Some comments anyways (mainly regarding mergeCpG function).

  1. rm and gc calls are pointless. e.g. on line 322 you remove LL to allocate an empty vector to it on next line, the other rm is before the function ends, hence variable will get out of scope inmediately and cleaned up by R automatically. R does gc when in need, most of the times calling it manually adds unneeded overheads.
  2. There are two loops where you are adding new elements on each iteration LL[[ff[i]]]; this is the most inefficient way to do things on R. The first loop could easily be vectorized (grouped mean via rowmeans), the second given the operations of "collapse unique" with strings maybe it's slightly more difficult (and probably less gains). Simple vectorized example run on my laptop shows 100x performance increase on first loop. Would be good to explore this options for better code.

@zitoa
Copy link
Copy Markdown
Contributor Author

zitoa commented Mar 30, 2026

@ESCRI11

  1. I've removed excessive use of rm() and gc().
  2. For metadata, I will then leave the R loop and use of LL[[ff[i]]]. For the other loop, i use colMeans. How would you do with rowmean? Maybe you meant rowsum? But NA (should be absent but who knows) are not handled by rowsum

@ESCRI11
Copy link
Copy Markdown
Contributor

ESCRI11 commented Apr 1, 2026

@zitoa Quick idea of how first loop could be vectorized:

## small example data
X <- matrix(c(
  0.1, 0.2, 0.3,
  0.4, NA,  0.6,
  0.7, 0.8, 0.9,
  NA,  0.5, 0.4,
  0.3, 0.3, 0.3,
  0.9, 0.9, NA
), nrow = 6, byrow = TRUE, dimnames = list(
  paste0("cg", 1:6),
  paste0("S", 1:3)
))
gene_sym <- c("A", "A", "B", "B", "B", "C")

X
#>      S1  S2  S3
#> cg1 0.1 0.2 0.3
#> cg2 0.4  NA 0.6
#> cg3 0.7 0.8 0.9
#> cg4  NA 0.5 0.4
#> cg5 0.3 0.3 0.3
#> cg6 0.9 0.9  NA
gene_sym
#> [1] "A" "A" "B" "B" "B" "C"

## current approach
collapse_loop <- function(X, gene_sym) {
  ff <- unique(gene_sym)
  LL <- list()
  for (i in seq_along(ff)) {
    jj <- which(gene_sym == ff[i])
    LL[[ff[i]]] <- t(as.matrix(colMeans(X[jj, , drop = FALSE], na.rm = TRUE)))
  }
  res <- do.call(rbind, LL)
  rownames(res) <- ff
  res
}

## vectorized example
collapse_vectorized <- function(X, gene_sym) {
  X_nona <- X
  not_na <- !is.na(X)
  X_nona[!not_na] <- 0

  sums   <- rowsum(X_nona, gene_sym, reorder = FALSE)
  counts <- rowsum(not_na * 1, gene_sym, reorder = FALSE)
  counts[counts == 0] <- NA  # 0/0 -> NA, same as colMeans(na.rm=TRUE)

  sums / counts
}

## compare
res_loop <- collapse_loop(X, gene_sym)
res_vec  <- collapse_vectorized(X, gene_sym)

res_loop
#>     S1        S2        S3
#> A 0.25 0.2000000 0.4500000
#> B 0.50 0.5333333 0.5333333
#> C 0.90 0.9000000       NaN
res_vec
#>     S1        S2        S3
#> A 0.25 0.2000000 0.4500000
#> B 0.50 0.5333333 0.5333333
#> C 0.90 0.9000000        NA
all.equal(res_loop, res_vec, tolerance = 1e-12)
#> [1] TRUE

Created on 2026-04-01 with reprex v2.1.1

@ESCRI11
Copy link
Copy Markdown
Contributor

ESCRI11 commented Apr 1, 2026

also, i dont understand the latest commits where you handle EPIC array and 850K array naming -- on OPG you just set EPIC array as an option, nowhere there is 850K array name

@zitoa
Copy link
Copy Markdown
Contributor Author

zitoa commented Apr 1, 2026

@ESCRI11 I set EPIC array & 850K array when i was testing playbase by script. it is in my list to stick only to EPIC. Will push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants