Cardinal 3 provides statistical methods for both supervised and unsupervised analysis of mass spectrometry (MS) imaging experiments. Class comparison can also be performed, provided an appropriate experimental design and sample size.
Before statistical analysis, it is important to identify the statistical goal of the experiment:
Unsupervised analysis. The data has no class labels or conditions, and we are interested in exploratory analysis to discover regions of interest in the data.
Supervised analysis. The data has class labels and we want to train a statistical or machine learning model to predict the class labels of new data.
Class comparison. The data has class labels or conditions, and we want to test whether the abundance of the mass features is different between conditions.
CardinalWorkflows provides real experimental data and more detailed discussion of the statistical methods than will be covered in this brief overview.
Suppose we are exploring an unlabeled dataset, and wish to understand the structure of the data.
set.seed(2020, kind="L'Ecuyer-CMRG")
mse <- simulateImage(preset=2, dim=c(32,32), sdnoise=0.5,
peakheight=c(2,4), centroided=TRUE)
mse$design <- makeFactor(circle=mse$circle,
square=mse$square, bg=!(mse$circle | mse$square))
image(mse, "design")
Principal components analysis is an unsupervised dimension reduction technique. It reduces the data to some number of “principal components” that are a linear combination of the original mass features, where each component is orthogonal to the last, and explains as much of the variance in the data as possible.
Use PCA()
to perform PCA on a
MSImagingExperiment
.
## SpatialPCA on 30 variables and 1024 observations
## names(5): sdev, rotation, center, scale, x
## coord(2): x = 1...32, y = 1...32
## runNames(1): run0
## modelData(): Principal components (k=3)
##
## Standard deviations (1, .., k=3):
## PC1 PC2 PC3
## 7.031542 3.516199 1.092932
##
## Rotation (n x k) = (30 x 3):
## PC1 PC2 PC3
## [1,] -0.03141217 0.21197865 0.03941824
## [2,] -0.02743754 0.19152844 0.16421233
## [3,] -0.02974002 0.19314984 0.11896429
## [4,] -0.05048566 0.32818833 -0.04828145
## [5,] -0.05499438 0.34063726 -0.22523541
## [6,] -0.06129265 0.39304819 -0.18998119
## ... ... ... ...
We can see that the first 2 principal components explain most of the variation in the data.
The loadings of the components show how each feature contributes to each component.
Plotting the principal component scores against each other is a useful way of visualization the separation between data classes.
Non-negative matrix factorization is a popular alternative to PCA when the data is naturally non-negative. The main difference between PCA and NMF is that, for NMF, all of the loadings are required to be non-negative.
Use NMF()
to perform NMF on a
MSImagingExperiment
.
## SpatialNMF on 30 variables and 1024 observations
## names(4): activation, x, iter, transpose
## coord(2): x = 1...32, y = 1...32
## runNames(1): run0
## modelData(): Non-negative matrix factorization (k=3)
##
## Activation (n x k) = (30 x 3):
## C1 C2 C3
## [1,] 0.13851961 4.62861520 0.08910358
## [2,] 0.16870519 4.45529992 0.05087191
## [3,] 0.17809608 4.40834020 0.05897761
## [4,] 0.04565289 6.91493042 0.10742341
## [5,] 0.31360556 6.88478991 0.24450698
## [6,] 0.00000000 8.03062851 0.12696027
## ... ... ... ...
We can see that NMF can pick up the variation somewhat better when the data is non-negative, as is the case for mass spectra. As before, we still only need 2 components.
As with PCA, the loadings of the NMF components show how each feature contributes to each component. The NMF components can be easier to interpret as they must be non-negative.
Plotting the principal component scores against each other is a useful way of visualization the separation between data classes.
Finding other mass features colocalized with a particular image is a common task in analysis of MS imaging experiments.
Use colocalize()
to find mass features that are
colocalized with another image.
## DataFrame with 30 rows and 7 columns
## i mz cor MOC M1 M2 Dice
## <integer> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 11 1003.36 1.000000 1.000000 1.000000 1.000000 1.000000
## 2 17 1162.23 0.885656 0.941764 0.949091 0.942454 0.910156
## 3 14 1063.12 0.877213 0.937279 0.937540 0.938529 0.906250
## 4 12 1011.54 0.862906 0.928454 0.931001 0.927100 0.886719
## 5 20 1247.65 0.853436 0.921156 0.937895 0.918964 0.873047
## ... ... ... ... ... ... ... ...
## 26 10 934.117 0.317962 0.590231 0.696931 0.597839 0.603516
## 27 8 843.577 0.302436 0.588514 0.694752 0.584861 0.595703
## 28 1 513.751 0.278940 0.581294 0.690446 0.582324 0.593750
## 29 3 707.896 0.256711 0.556085 0.669893 0.566395 0.578125
## 30 2 610.262 0.240990 0.556558 0.667740 0.566149 0.574219
By default, Pearson correlation is used to rank the colocalized features. Manders overlap coefficient (MOC), colocalization coefficients (M1 and M2), and Dice scores are also provided.
Segmentation (clustering) a dataset is a useful way to summarize an MS imaging experiment and discover regions of interest within the sample.
Spatially-aware nearest shrunken centroids clustering allows simultaneous image segmentation and feature selection.
A smoothing radius r
, initial number of clusters
k
, and sparsity parameters s
must be
provided.
The larger the sparsity parameter s
, the fewer mass
features will contribute to the segmentation.
Spatial shrunken centroids may result in fewer clusters than the
initial number of clusters k
, so it is recommended to use a
value for k
that is larger than the expected number of
clusters, and allow the method to automatically choose the number of
clusters.
set.seed(2020, kind="L'Ecuyer-CMRG")
ssc <- spatialShrunkenCentroids(mse, r=1, k=3, s=c(0,6,12,18))
ssc
## ResultsList of length 4
## names(4): r=1,k=3,s=0 r=1,k=3,s=6 r=1,k=3,s=12 r=1,k=3,s=18
## model: SpatialShrunkenCentroids
## r k s weights clusters sparsity AIC BIC
## r=1,k=3,s=0 1 3 0 gaussian 3 0.00 259.0647 850.8414
## r=1,k=3,s=6 1 3 6 gaussian 3 0.23 237.2789 725.4946
## r=1,k=3,s=12 1 3 12 gaussian 3 0.56 365.2250 710.4280
## r=1,k=3,s=18 1 3 18 gaussian 3 0.58 610.2789 945.6190
Plotting the predicted cluster probabilities shows a clear segmentation into the ground truth image.
Spatial shrunken centroids calculates t-statistics for each segment and each mass feature. These t-statistics a measure of the difference between the cluster center and the global mean.
Mass features with t-statistics of zero do not contribute to the segmentation. The sign of the t-statistic indicates whether the mass feature is over- or under-expressed in the given cluster relative to the global mean.
Use topFeatures()
to rank mass features by
t-statistic.
## DataFrame with 90 rows and 6 columns
## i mz class statistic centers sd
## <integer> <numeric> <character> <numeric> <numeric> <numeric>
## 1 30 1983.41 2 25.5293 4.78972 0.983987
## 2 22 1340.73 2 21.9981 4.74217 1.035426
## 3 28 1721.92 2 21.2683 3.32087 0.857418
## 4 26 1629.57 2 20.0800 3.39765 0.839724
## 5 25 1524.34 2 19.9814 4.72049 1.053604
## ... ... ... ... ... ... ...
## 86 22 1340.73 3 -20.0457 1.68570 1.035426
## 87 11 1003.36 3 -22.0753 1.38423 0.974763
## 88 30 1983.41 3 -22.6620 1.36831 0.983987
## 89 14 1063.12 3 -22.8877 1.75624 1.031813
## 90 17 1162.23 3 -23.4116 1.17414 1.015804
Spatially-aware Dirichlet Gaussian mixture models (spatial-DGMM) is a method of image segmentation applied to each mass feature individually, rather than the dataset as a whole.
This is useful for summarizing molecular ion images, and for discovering structures that clustering using all mass features together may miss.
## SpatialDGMM on 30 variables and 1024 observations
## names(10): class, mu, sigma, ..., weights, r, k
## coord(2): x = 1...32, y = 1...32
## runNames(1): run0
## modelData(): Spatial Gaussian mixture model (k=3, channels=30)
##
## Groups: run0
##
## Parameter estimates:
## $mu
## , , 1
## 1 2 3
## run0 3.6332371 1.6076574 0.4928379
## , , ...
##
## $sigma
## , , 1
## 1 2 3
## run0 0.8967212 0.4318986 0.3117729
## , , ...
A different segmentation is fit for each mass feature.
Each image is modeled as a mixture of Gaussian distributions.
Spatial-DGMM segmentations can be especially useful for finding mass features colocalized with a region-of-interest.
When applied to a SpatialDGMM
object,
colocalize()
is able to use match scores that can have a
higher specificity than using Pearson correlation on the raw ion
images.
## DataFrame with 30 rows and 6 columns
## i mz MOC M1 M2 Dice
## <integer> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 30 1983.41 0.904051 0.980769 0.833333 0.901060
## 2 22 1340.73 0.903915 0.939929 0.869281 0.903226
## 3 25 1524.34 0.892157 0.892157 0.892157 0.892157
## 4 21 1286.70 0.875343 0.926740 0.826797 0.873921
## 5 26 1629.57 0.857069 0.944444 0.777778 0.853047
## ... ... ... ... ... ... ...
## 26 10 934.117 0.507495 0.364865 0.705882 0.481069
## 27 2 610.262 0.507243 0.349922 0.735294 0.474183
## 28 1 513.751 0.502791 0.353226 0.715686 0.473002
## 29 9 860.483 0.492366 0.363636 0.666667 0.470588
## 30 5 769.648 0.488522 0.412587 0.578431 0.481633
Classification of pixels into different known classes (e.g., cancer vs normal) based on the mass spectra is a common application for MS imaging.
set.seed(2020, kind="L'Ecuyer-CMRG")
mse2 <- simulateImage(preset=7, dim=c(32,32), sdnoise=0.3,
nrun=3, peakdiff=2, centroided=TRUE)
mse2$class <- makeFactor(A=mse2$circleA, B=mse2$circleB)
image(mse2, "class", layout=c(1,3))
When performing classification, it is important to use cross-validation so that reported accuracies are not overly optimistic.
We strongly recomend making sure that all spectra from the same experiment run belong to the same fold, to reduce predictive bias due to run effects.
Projection to latent structures (PLS), also called partial least squares, is a supervised dimension reduction technique. It can be thought of as being similar to PCA, but for classification or regression.
## SpatialCV on 30 variables and 3072 observations
## names(4): average, scores, folds, fitted.values
## coord(2): x = 1...32, y = 1...32
## runNames(3): run0, run1, run2
## modelData(): Cross validation with 3 folds
##
## Average accuracy:
## MacroRecall MacroPrecision
## ncomp=1 0.6232753 0.7585960
## ncomp=2 0.6224242 0.7417259
## ncomp=3 0.8952339 0.9413301
## ncomp=4 0.9798008 0.9851093
## ncomp=5 0.9361302 0.9472779
We can see that 4 components gives the best accuracy.
## SpatialPLS on 30 variables and 3072 observations
## names(16): coefficients, projection, residuals, ..., y.center, y.scale, algorithm
## coord(2): x = 1...32, y = 1...32
## runNames(3): run0, run1, run2
## modelData(): Partial least squares (k=4)
##
## Covariances (1, .., k=4):
## C1 C2 C3 C4
## 130998.849 25693.147 4946.968 20892.986
##
## Coefficients:
## [,1] [,2] [,3] [,4] [,5] [,6] ...
## A -0.04066857 -0.04386000 -0.02762318 -0.03173300 -0.05053358 -0.04200534 ...
## B 0.04778611 0.03924049 0.03749723 0.03776987 0.05079571 0.05061579 ...
We can plot the fitted response values to visualize the prediction.
The PLS regression coefficients can be used to find influential features.
Like PCA or NMF, it can be useful to plot the PLS scores against each other to visualize the separation between classes.
Note that orthgonal PLS (O-PLS) is also available via
method="opls"
or by using the separate OPLS()
method. Typically, both methods perform similarly, although O-PLS can
sometimes produce more easily interpretable regression coefficients.
Spatially-aware nearest shrunken centroids classification is an extension of nearest shrunken centroids that incorporates spatial information into the model.
Like in the clustering case of spatial shrunken centroids, a
smoothing radius r
must be provided along with sparsity
parameters s
.
cv_ssc <- crossValidate(spatialShrunkenCentroids, x=mse2, y=mse2$class,
r=2, s=c(0,3,6,9,12,15,18), folds=run(mse2))
cv_ssc
## SpatialCV on 30 variables and 3072 observations
## names(4): average, scores, folds, fitted.values
## coord(2): x = 1...32, y = 1...32
## runNames(3): run0, run1, run2
## modelData(): Cross validation with 3 folds
##
## Average accuracy:
## MacroRecall MacroPrecision
## r=2,s=0 0.7598126 0.8191378
## r=2,s=3 0.7780299 0.8272942
## r=2,s=6 0.7896850 0.8321012
## r=2,s=9 0.7931066 0.8321976
## r=2,s=12 0.7916638 0.8244695
## r=2,s=15 0.7834624 0.8080932
## r=2,s=18 0.7797579 0.7999117
We can see that the model with s=9
has the best
cross-validated accuracy for the data. However, it does not perform as
well as the PLS model.
## SpatialShrunkenCentroids on 30 variables and 3072 observations
## names(12): class, probability, scores, ..., transpose, weights, r
## coord(2): x = 1...32, y = 1...32
## runNames(3): run0, run1, run2
## modelData(): Nearest shrunken centroids (s=9.00) with 2 classes
##
## Priors (1, .., k=2):
## A B
## 0.5118644 0.4881356
##
## Statistics:
## A B
## [1,] 2.425884 32.879399
## [2,] . 31.930529
## [3,] 4.013525 37.035639
## [4,] 1.947710 37.887562
## [5,] 1.115648 33.651269
## [6,] 6.116447 51.341961
## ... ... ...
Again, we can plot the predicted class probabilities to visualize the prediction.
Plotting t-statistics shows most relevant peaks have a higher abundance in class “B”.
## DataFrame with 30 rows and 6 columns
## i mz class statistic centers sd
## <integer> <numeric> <character> <numeric> <numeric> <numeric>
## 1 6 795.911 B 51.3420 5.34589 0.822671
## 2 7 796.933 B 48.8423 5.12262 0.852196
## 3 9 860.483 B 47.2551 4.85455 0.672109
## 4 10 934.117 B 40.0473 3.87655 0.723529
## 5 8 843.577 B 38.0596 4.23740 0.748395
## ... ... ... ... ... ... ...
## 26 26 1629.57 B 0 2.90909 0.768785
## 27 27 1663.66 B 0 2.38996 0.762143
## 28 28 1721.92 B 0 3.58828 0.812297
## 29 29 1797.43 B 0 3.05349 0.806383
## 30 30 1983.41 B 0 3.05391 0.916468
Statistical hypothesis testing is used to determine whether the abundance of a feature is different between two or more conditions.
In order to account for additional factors like the effect of experimental runs, subject-to-subject variability, etc., this is often done most appropriately using linear models.
This example uses a simple experiment with two conditions “A” and “B”, with three replicates in each condition.
set.seed(2020, kind="L'Ecuyer-CMRG")
mse3 <- simulateImage(preset=4, npeaks=10, dim=c(32,32), sdnoise=0.3,
nrun=4, peakdiff=2, centroided=TRUE)
mse3$trt <- makeFactor(A=mse3$circleA, B=mse3$circleB)
image(mse3, "trt", layout=c(2,4))
We know from the design of the simulation that the first 5 (of 10) m/z values differ between the conditions.
## MassDataFrame with 10 rows and 4 columns
## mz circleA circleB diff
## <numeric> <numeric> <numeric> <logical>
## 1 707.896 2.78062 4.78062 TRUE
## 2 796.933 2.94643 4.94643 TRUE
## 3 860.483 2.69110 4.69110 TRUE
## 4 1011.540 2.89764 4.89764 TRUE
## 5 1063.117 2.80560 4.80560 TRUE
## 6 1173.871 2.93415 2.93415 FALSE
## 7 1340.725 2.57144 2.57144 FALSE
## 8 1497.288 2.63123 2.63123 FALSE
## 9 1524.336 2.92315 2.92315 FALSE
## 10 1983.406 2.62873 2.62873 FALSE
## mz(1): mz
Use meansTest()
to fit linear models with the most basic
summarization. The samples must be specified with samples
.
Each sample is summarized by its mean, and then a linear model is fit to
the summaries. In this case, each sample is a separate run.
Here, we specify condition
as the sole fixed effect.
Internally, the model will call either lm()
or
lme()
depending on whether any random effects are
provided.
## MeansTest of length 10
## model: lm
## i mz fixed statistic pvalue
## 1 1 707.8959 intensity ~ condition 8.12516926 0.004365491
## 2 2 796.9335 intensity ~ condition 8.45521072 0.003639991
## 3 3 860.4833 intensity ~ condition 10.55219137 0.001160504
## 4 4 1011.5401 intensity ~ condition 3.34822414 0.067277558
## 5 5 1063.1168 intensity ~ condition 0.74142667 0.389204248
## 6 6 1173.8713 intensity ~ condition 1.60483148 0.205219861
## 7 7 1340.7251 intensity ~ condition 0.05559061 0.813606015
## 8 8 1497.2877 intensity ~ condition 0.16639609 0.683334781
## 9 9 1524.3358 intensity ~ condition 9.18008397 0.002446628
## 10 10 1983.4065 intensity ~ condition 0.02200770 0.882066616
By default, the models are summarized by performing likelihood ratio tests against the null model (with no fixed effects, retaining any random effects).
Box-and-whisker plots can be used to visualize the differences (if any) between the conditions.
Use topFeatures()
to rank the results.
## DataFrame with 4 rows and 5 columns
## i mz statistic pvalue fdr
## <integer> <numeric> <numeric> <numeric> <numeric>
## 1 3 860.483 10.55219 0.00116050 0.0109137
## 2 9 1524.336 9.18008 0.00244663 0.0109137
## 3 2 796.933 8.45521 0.00363999 0.0109137
## 4 1 707.896 8.12517 0.00436549 0.0109137
We find 3 of the 5 differentially abundant features (and 1 false discovery).
Testing of SpatialDGMM
objects is also supported by
meansTest()
. The key idea here is that spatial-DGMM
segmentation captures within-sample heterogeneity, so testing between
spatial-DGMM segments is more sensitive that simply summarizing a whole
sample by its mean.
First, we must segment the data with spatialDGMM()
,
while making sure that each sample is segmented independently (by
specifying the samples as groups
).
Now use segmentationTest()
to fit the models.
## MeansTest of length 10
## model: lm
## i mz fixed statistic pvalue
## 1 1 707.8959 intensity ~ condition 9.107970720 0.002544981
## 2 2 796.9335 intensity ~ condition 8.585064604 0.003389314
## 3 3 860.4833 intensity ~ condition 9.768705320 0.001775074
## 4 4 1011.5401 intensity ~ condition 4.629620940 0.031424506
## 5 5 1063.1168 intensity ~ condition 4.325870520 0.037537209
## 6 6 1173.8713 intensity ~ condition 1.710693208 0.190895455
## 7 7 1340.7251 intensity ~ condition 0.000487530 0.982384076
## 8 8 1497.2877 intensity ~ condition 0.007429733 0.931310690
## 9 9 1524.3358 intensity ~ condition 6.881987373 0.008706870
## 10 10 1983.4065 intensity ~ condition 0.220321661 0.638794938
As with meansTest()
, the models are summarized by
performing likelihood ratio tests against the null model (with no fixed
effects, retaining any random effects).
Box-and-whisker plots can be used to visually compare the conditions.
Use topFeatures()
to rank the results.
## DataFrame with 4 rows and 5 columns
## i mz statistic pvalue fdr
## <integer> <numeric> <numeric> <numeric> <numeric>
## 1 3 860.483 9.76871 0.00177507 0.0112977
## 2 1 707.896 9.10797 0.00254498 0.0112977
## 3 2 796.933 8.58506 0.00338931 0.0112977
## 4 9 1524.336 6.88199 0.00870687 0.0217672
We find 3 of the 5 differentially abundant features (and 1 false discovery).
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## Random number generation:
## RNG: L'Ecuyer-CMRG
## Normal: Inversion
## Sample: Rejection
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] Cardinal_3.9.0 S4Vectors_0.45.1 ProtGenerics_1.39.0
## [4] BiocGenerics_0.53.2 generics_0.1.3 BiocParallel_1.41.0
## [7] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-1 EBImage_4.49.0 jsonlite_1.8.9
## [4] matter_2.9.0 compiler_4.4.2 BiocManager_1.30.25
## [7] Biobase_2.67.0 bitops_1.0-9 parallel_4.4.2
## [10] jquerylib_0.1.4 CardinalIO_1.5.0 png_0.1-8
## [13] yaml_2.3.10 fastmap_1.2.0 lattice_0.22-6
## [16] R6_2.5.1 knitr_1.49 htmlwidgets_1.6.4
## [19] ontologyIndex_2.12 fftwtools_0.9-11 maketools_1.3.1
## [22] bslib_0.8.0 tiff_0.1-12 rlang_1.1.4
## [25] cachem_1.1.0 xfun_0.49 sass_0.4.9
## [28] sys_3.4.3 cli_3.6.3 digest_0.6.37
## [31] grid_4.4.2 locfit_1.5-9.10 irlba_2.3.5.1
## [34] nlme_3.1-166 lifecycle_1.0.4 evaluate_1.0.1
## [37] codetools_0.2-20 buildtools_1.0.0 RCurl_1.98-1.16
## [40] abind_1.4-8 rmarkdown_2.29 tools_4.4.2
## [43] jpeg_0.1-10 htmltools_0.5.8.1