Skip to content

Commit 174764b

Browse files
committed
Add results with updated library (fragments with zero intensity removed).
1 parent 619338e commit 174764b

33 files changed

Lines changed: 38683 additions & 819 deletions

File tree

README.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,14 @@ PeakDecoder is a machine learning-based metabolite identification algorithm for
44

55
### Workflow
66

7-
* Step-1: data is processed in untargeted mode (using MS-DIAL) to extract all precursor ion features (MS1) and their respective deconvoluted fragment ions (pseudo MS2) based on co-elution and co-mobility.
8-
* Step-2: a preliminary training set is generated by using the detected and deconvoluted peak-groups as targets and producing their corresponding decoys.
9-
* Step-3: targeted data extraction is performed (usig Skyline) to extract the precursor and fragment ion signals for the training set from all the LC-IM-MS runs and export their XIC metrics.
10-
* Step-4: an SVM classifier is trained using multiple scores calculated from the XIC metrics of the training set. Before training, filtering for high-quality fragments is applied to keep high-quality peak-groups as targets (i.e., based on various thresholds for metrics of precursor and at least 3 fragments: S/N, mass error, RT difference to precursor, and FWHM difference to precursor; details in Methods) and their corresponding decoys in the final training set. The model learns to distinguish true and false co-elution and co-mobility, independently of the features’ metabolite identity.
11-
* Step-5: TDX is performed to extract the signals of the query set of metabolites in the library from all the LC-IM-MS runs and export their XIC metrics.
12-
* Step-6: the trained model is used to determine the PeakDecoder score of the query set of metabolites and estimate an FDR.
7+
* <b>Step-1, Feature finding and fragment ion deconvolution:</b> data is processed in untargeted mode (using MS-DIAL) to extract all precursor ion features (MS1) and their respective deconvoluted fragment ions (pseudo MS2) based on co-elution and co-mobility.
8+
The alignment (Peak ID matrix, msp format) and all peak lists (txt, centroid) should be exported from MS-DIAL.
9+
* <b>Step-2, Target and decoy generation:</b> a preliminary training set is generated by using the detected and deconvoluted peak-groups as targets and producing their corresponding decoys.
10+
* <b>Step-3, Targeted data extraction for training:</b> targeted data extraction is performed (usig Skyline) to extract the precursor and fragment ion signals for the training set from all the LC-IM-MS runs and export their XIC metrics.
11+
The Skyline report should include the required XIC metrics: area, height, mass error, FWHM (LC), RT, expected RT, expected CCS.
12+
* <b>Step-4, Machine learning training:</b> an SVM classifier is trained using multiple scores calculated from the XIC metrics of the training set. Before training, filtering for high-quality fragments is applied to keep high-quality peak-groups as targets (i.e., based on various thresholds for metrics of precursor and at least 3 fragments: S/N, mass error, RT difference to precursor, and FWHM difference to precursor) and their corresponding decoys in the final training set. The model learns to distinguish true and false co-elution and co-mobility, independently of the features’ metabolite identity.
13+
* <b>Step-5, Targeted data extraction for inference:</b> TDX is performed to extract the signals of the query set of metabolites in the library from all the LC-IM-MS runs and export their XIC metrics.
14+
* <b>Step-6, Machine learning inference: the trained</b> model is used to determine the PeakDecoder score of the query set of metabolites and estimate an false discovery rate (FDR). Results can be filtered using the PeakDecoder score corresponding to the estimated FDR threshold from a table with pairs of values (FDR, PeakDecoder score) automatically generated after training (file PeakDecoder-FDR-thresholds_[dataset].csv).
1315

1416

1517
### Data

ScoringInference.R

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -25,16 +25,18 @@ cutoff.PeakDecoderScore = 0.8 #
2525
#inputFileSamples = "Skyline-Report_Pput_IMS_MS1-and-MSMS.csv"
2626
#cutoff.PeakDecoderScore = 0.96 # 1% FDR
2727

28-
#setwd(file.path(basePathData,"data/Rhodo"))
29-
#inputFileSamples = "Skyline-Report_Rhodo_IMS_MS1-and-MSMS.csv"
30-
#cutoff.PeakDecoderScore = 0.9 # 3% FDR
28+
# setwd(file.path(basePathData,"data/Rhodo"))
29+
# inputFileSamples = "Skyline-Report_Rhodo_IMS_MS1-and-MSMS.csv"
30+
# #inputFileSamples = "Skyline-Report_Rhodo-trypD5_IMS_MS1-and-MSMS.csv"
31+
# #cutoff.PeakDecoderScore = 0.9 # 3% FDR
32+
# cutoff.PeakDecoderScore = 0.96 # 1% FDR
3133

3234
sampleString = basename(getwd())
3335

3436
# Thresholds to consider molecules as identified in at least one sample:
3537
cutoff.Mz.Error = 18 # precursor mass tolerance, ppm
3638
cutoff.RT.Error = 0.3 # minutes
37-
cutoff.CCS.Error = 1 # percent
39+
cutoff.CCS.Error = 0.8 # percent
3840
cutoff.SignalToNoise = 2
3941

4042

@@ -90,6 +92,7 @@ dat$methodCE = "20V"
9092
dat$methodCE[grepl("_40V", dat$Replicate)] = "40V"
9193

9294
dat = merge(dat, ms2lib, by=c("PrecursorName", "ProductName", "methodCE"), all.x = TRUE)
95+
dat = dat[!(is.na(dat$LibraryIntensity) & dat$ProductName != "precursor"), ] # remove fragments that were not mapped to the library
9396

9497
# Update intensity ranks:
9598
dat$PrecursorName.Replicate = paste(dat$PrecursorName, dat$Replicate, sep = '.')
@@ -191,7 +194,7 @@ for(f in featureFiles)
191194

192195
for(k in indexes)
193196
{
194-
x = feats[which(abs(feats$Precursor.m.z - outTab$Precursor.Mz[k]) <= 0.02 &
197+
x = feats[which(abs(feats$Precursor.m.z - outTab$Precursor.Mz[k]) <= 0.3 & # Use a larger tolerance because MSDIAL does not process well saturated peaks
195198
abs(feats$RT..min. - outTab$Retention.Time[k]) <= cutoff.RT.Error), ]
196199
if(nrow(x) > 0)
197200
{
@@ -202,6 +205,9 @@ for(f in featureFiles)
202205
}
203206

204207
# Apply filtering thresholds:
208+
outTab$Mass.Error.PPM = round(outTab$Mass.Error.PPM, 2)
209+
outTab$CCS.Error = round(outTab$CCS.Error, 2)
210+
outTab$RetentionTime.Error = round(outTab$RetentionTime.Error, 2)
205211
outTab$ConfidenceDescription = "None"
206212
outTab$ConfidenceDescription[which(abs(outTab$Mass.Error.PPM) <= cutoff.Mz.Error &
207213
abs(outTab$CCS.Error) <= cutoff.CCS.Error &
@@ -210,10 +216,11 @@ outTab$ConfidenceDescription[which(abs(outTab$Mass.Error.PPM) <= cutoff.Mz.Error
210216
outTab$ConfidenceDescription[which(outTab$PeakDecoderScore >= cutoff.PeakDecoderScore &
211217
outTab$ConfidenceDescription == "RT-CCS")] = "RT-CCS-DIA"
212218

213-
# Exclude "RT-CCS" which have fragments but low combined score:
214-
outTab$ConfidenceDescription[which((outTab$PeakDecoderScore < cutoff.PeakDecoderScore | is.na(outTab$PeakDecoderScore)) &
215-
outTab$CountDetectedFragments > 0 &
216-
outTab$ConfidenceDescription == "RT-CCS")] = "None"
219+
# Exclude "RT-CCS" of replicates which have fragments but PeakDecoderScore is below threshold, PDS < cutoff.PeakDecoderScore:
220+
indexesLowPDS = which((outTab$PeakDecoderScore < cutoff.PeakDecoderScore | is.na(outTab$PeakDecoderScore)) &
221+
outTab$CountDetectedFragments > 0 &
222+
outTab$ConfidenceDescription == "RT-CCS")
223+
outTab$ConfidenceDescription[indexesLowPDS] = "None"
217224

218225
outUniqueBest = outTab[which(outTab$SignalToNoise > cutoff.SignalToNoise), ]
219226
outUniqueBest = outUniqueBest[with(outUniqueBest, order(PrecursorName, ConfidenceDescription, -PeakDecoderScore, abs(CCS.Error))), ]
@@ -249,6 +256,10 @@ write.csv(outTab[which(outTab$PrecursorName %in% unique(outUniqueBest$PrecursorN
249256
write.csv(outTab[which(outTab$PrecursorName %in% unique(outUniqueBest$PrecursorName)), ],
250257
file = paste("Detected-molecules-all-metrics-replicates_", sampleString, "_", Sys.Date(), ".csv", sep=''), row.names = FALSE)
251258

259+
# Save all-metrics ALL targets table:
260+
write.csv(outTab,
261+
file = paste("All-targets-all-metrics-replicates_", sampleString, "_", Sys.Date(), ".csv", sep=''), row.names = FALSE)
262+
252263

253264
# ---------------------------------------------------
254265
manualIDs = read.csv(file = paste("Detected-metabolites-manual_", sampleString, ".csv", sep=''), stringsAsFactors = FALSE)

ScoringTraining.R

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -115,11 +115,13 @@ targets$countBadMetrics[indexes] = targets$countBadMetrics[indexes] + 1
115115

116116
# RT difference to precursor, flag fragment if RT difference is larger than tolerance:
117117
#indexes = which(abs(targets$DIA.RTdiff) > targets$Precursor.Fwhm * 0.1)
118-
indexes = which(abs(targets$DIA.RTdiff) > 0.1)
118+
#indexes = which(abs(targets$DIA.RTdiff) > 0.1)
119+
indexes = which(abs(targets$DIA.RTdiff) > 0.025)
119120
targets$countBadMetrics[indexes] = targets$countBadMetrics[indexes] + 1
120121

121122
# FWHM difference to precursor, flag fragment if FWHM is larger than the precursor FWHM:
122-
indexes = which(abs(targets$DIA.FWHMdiff) > targets$Precursor.Fwhm * 2)
123+
#indexes = which(abs(targets$DIA.FWHMdiff) > targets$Precursor.Fwhm * 2)
124+
indexes = which(abs(targets$DIA.FWHMdiff) > targets$Precursor.Fwhm / 2)
123125
targets$countBadMetrics[indexes] = targets$countBadMetrics[indexes] + 1
124126

125127
hist(targets$countBadMetrics)

TargetDecoyGenerator.R

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ alignmentTable = "xMSDIAL/PeakID_0_20221211731.txt"
2020

2121
#setwd(file.path(basePathData,"data/Rhodo"))
2222
#alignmentTable = "xMSDIAL/PeakID_0_2022126180.txt"
23+
alignmentTable = "xMSDIAL/PeakID_1_20221171618.txt"
2324

2425

2526
sampleString = basename(getwd())

0 commit comments

Comments
 (0)