Title: | Earth Mover's Distance for Differential Analysis of Genomics Data |
---|---|
Description: | The EMDomics algorithm is used to perform a supervised multi-class analysis to measure the magnitude and statistical significance of observed continuous genomics data between groups. Usually the data will be gene expression values from array-based or sequence-based experiments, but data from other types of experiments can also be analyzed (e.g. copy number variation). Traditional methods like Significance Analysis of Microarrays (SAM) and Linear Models for Microarray Data (LIMMA) use significance tests based on summary statistics (mean and standard deviation) of the distributions. This approach lacks power to identify expression differences between groups that show high levels of intra-group heterogeneity. The Earth Mover's Distance (EMD) algorithm instead computes the "work" needed to transform one distribution into another, thus providing a metric of the overall difference in shape between two distributions. Permutation of sample labels is used to generate q-values for the observed EMD scores. This package also incorporates the Komolgorov-Smirnov (K-S) test and the Cramer von Mises test (CVM), which are both common distribution comparison tests. |
Authors: | Sadhika Malladi [aut, cre], Daniel Schmolze [aut, cre], Andrew Beck [aut], Sheida Nabavi [aut] |
Maintainer: | Sadhika Malladi <[email protected]> and Daniel Schmolze <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.37.0 |
Built: | 2024-10-30 05:28:57 UTC |
Source: | https://github.com/bioc/EMDomics |
calculate_emd
, calculate_cvm
, or calculate_ks
will usually be the only functions needed, depending on the type of distribution comparison
test that is desired.
This is a main user interface to the EMDomics package, and
will usually the only function needed when conducting an analysis using the CVM
algorithm. Analyses can also be conducted with the Komolgorov-Smirnov Test using
calculate_ks
or the Earth Mover's Distance algorithm using calculate_emd
.
The algorithm is used to compare genomics data between any number of groups. Usually the data will be gene expression values from array-based or sequence-based experiments, but data from other types of experiments can also be analyzed (e.g. copy number variation).
Traditional methods like Significance Analysis of Microarrays (SAM) and Linear Models for Microarray Data (LIMMA) use significance tests based on summary statistics (mean and standard deviation) of the two distributions. This approach tends to give non-significant results if the two distributions are highly heterogeneous, which can be the case in many biological circumstances (e.g sensitive vs. resistant tumor samples).
The Cramer von Mises (CVM) algorithm generates a test statistic that is the sum of the squared values of the differences between two cumulative distribution functions (CDFs). As a result, the test statistic tends to overestimate the similarity between two distributions and cannot effectively handle partial matching like EMD does. However, it is one of the most commonly referenced nonparametric two-class distribution comparison tests in non-genomic contexts.
The CVM-based algorithm implemented in EMDomics has two main steps. First, a matrix (e.g. of expression data) is divided into data for each of the groups. Every possible pairwise CVM score is then computed and stored in a table. The CVM score for a single gene is calculated by averaging all of the pairwise CVM scores. Next, the labels for each of the groups are randomly permuted a specified number of times, and an CVM score for each permutation is calculated. The median of the permuted scores for each gene is used as the null distribution, and the False Discovery Rate (FDR) is computed for a range of permissive to restrictive significance thresholds. The threshold that minimizes the FDR is defined as the q-value, and is used to interpret the significance of the CVM score analogously to a p-value (e.g. q-value < 0.05 is significant.)
calculate_cvm(data, outcomes, nperm = 100, pairwise.p = FALSE, seq = FALSE, quantile.norm = FALSE, verbose = TRUE, parallel = TRUE)
calculate_cvm(data, outcomes, nperm = 100, pairwise.p = FALSE, seq = FALSE, quantile.norm = FALSE, verbose = TRUE, parallel = TRUE)
data |
A matrix containing genomics data (e.g. gene expression levels). The rownames should contain gene identifiers, while the column names should contain sample identifiers. |
outcomes |
A vector containing group labels for each of the samples provided
in the |
nperm |
An integer specifying the number of randomly permuted CVM scores to be computed. Defaults to 100. |
pairwise.p |
Boolean specifying whether the permutation-based q-values should
be computed for each pairwise comparison. Defaults to |
seq |
Boolean specifying if the given data is RNA Sequencing data and ought to be
normalized. Set to |
quantile.norm |
Boolean specifying is data should be normalized by quantiles. If
|
verbose |
Boolean specifying whether to display progress messages. |
parallel |
Boolean specifying whether to use parallel processing via
the BiocParallel package. Defaults to |
The function returns an CVMomics
object.
CVMomics
CramerVonMisesTwoSamples
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_cvm(dat, outcomes, nperm=10, parallel=FALSE) head(results$cvm)
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_cvm(dat, outcomes, nperm=10, parallel=FALSE) head(results$cvm)
Calculate CVM score for a single gene
calculate_cvm_gene(vec, outcomes, sample_names)
calculate_cvm_gene(vec, outcomes, sample_names)
vec |
A named vector containing data (e.g. expression data) for a single gene. |
outcomes |
A vector of group labels for the samples. The names must correspond
to the names of |
sample_names |
A character vector with the names of the samples in |
All possible combinations of the classes are used as pairwise comparisons.
The data in vec
is divided based on class labels based on the outcomes
identifiers given. For each pairwise computation, the hist
function is
used to generate histograms for the two groups. The densities are then retrieved
and passed to CramerVonMisesTwoSamples
to compute the pairwise CVM score. The
total CVM score for the given data is the average of the pairwise CVM scores.
The cvm score is returned.
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) calculate_cvm_gene(dat[1,], outcomes, colnames(dat))
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) calculate_cvm_gene(dat[1,], outcomes, colnames(dat))
This is the main user interface to the EMDomics package, and
will usually the only function needed when conducting an analysis using the EMD
algorithm. Analyses can also be conducted with the Komolgorov-Smirnov Test using
calculate_ks
or the Cramer Von Mises algorithm using calculate_cvm
.
The algorithm is used to compare genomics data between any number of groups. Usually the data will be gene expression values from array-based or sequence-based experiments, but data from other types of experiments can also be analyzed (e.g. copy number variation).
Traditional methods like Significance Analysis of Microarrays (SAM) and Linear Models for Microarray Data (LIMMA) use significance tests based on summary statistics (mean and standard deviation) of the two distributions. This approach tends to give non-significant results if the two distributions are highly heterogeneous, which can be the case in many biological circumstances (e.g sensitive vs. resistant tumor samples).
The Earth Mover's Distance algorithm instead computes the "work" needed to transform one distribution into another, thus capturing possibly valuable information relating to the overall difference in shape between two heterogeneous distributions.
The EMD-based algorithm implemented in EMDomics has two main steps. First, a matrix (e.g. of expression data) is divided into data for each of the groups. Every possible pairwise EMD score is then computed and stored in a table. The EMD score for a single gene is calculated by averaging all of the pairwise EMD scores. Next, the labels for each of the groups are randomly permuted a specified number of times, and an EMD score for each permutation is calculated. The median of the permuted scores for each gene is used as the null distribution, and the False Discovery Rate (FDR) is computed for a range of permissive to restrictive significance thresholds. The threshold that minimizes the FDR is defined as the q-value, and is used to interpret the significance of the EMD score analogously to a p-value (e.g. q-value < 0.05 is significant.)
Because EMD is based on a histogram binning of the expression levels, data that cannot be binned will be discarded, and a message for that gene will be printed. The most likely reason for histogram binning failing is due to uniform values (e.g. all 0s).
calculate_emd(data, outcomes, binSize = 0.2, nperm = 100, pairwise.p = FALSE, seq = FALSE, quantile.norm = FALSE, verbose = TRUE, parallel = TRUE)
calculate_emd(data, outcomes, binSize = 0.2, nperm = 100, pairwise.p = FALSE, seq = FALSE, quantile.norm = FALSE, verbose = TRUE, parallel = TRUE)
data |
A matrix containing genomics data (e.g. gene expression levels). The rownames should contain gene identifiers, while the column names should contain sample identifiers. |
outcomes |
A vector containing group labels for each of the samples provided
in the |
binSize |
The bin size to be used when generating histograms of the data for each group. Defaults to 0.2. |
nperm |
An integer specifying the number of randomly permuted EMD scores to be computed. Defaults to 100. |
pairwise.p |
Boolean specifying whether the permutation-based q-values should
be computed for each pairwise comparison. Defaults to |
seq |
Boolean specifying if the given data is RNA Sequencing data and ought to be
normalized. Set to |
quantile.norm |
Boolean specifying is data should be normalized by quantiles. If
|
verbose |
Boolean specifying whether to display progress messages. |
parallel |
Boolean specifying whether to use parallel processing via
the BiocParallel package. Defaults to |
The function returns an EMDomics
object.
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_emd(dat, outcomes, nperm=10, parallel=FALSE) head(results$emd)
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_emd(dat, outcomes, nperm=10, parallel=FALSE) head(results$emd)
Calculate EMD score for a single gene
calculate_emd_gene(vec, outcomes, sample_names, binSize = 0.2)
calculate_emd_gene(vec, outcomes, sample_names, binSize = 0.2)
vec |
A named vector containing data (e.g. expression data) for a single gene. Names ought to correspond to samples. |
outcomes |
A vector of group labels for the samples. The names must correspond
to the names of |
sample_names |
A character vector with the names of the samples in |
binSize |
The bin size to be used when generating histograms for each of the groups. |
All possible combinations of the classes are used as pairwise comparisons.
The data in vec
is divided based on class labels based on the outcomes
identifiers given. For each pairwise computation, the hist
function is
used to generate histograms for the two groups. The densities are then retrieved
and passed to emd2d
to compute the pairwise EMD score. The
total EMD score for the given data is the average of the pairwise EMD scores.
The emd score is returned.
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) calculate_emd_gene(dat[1,], outcomes, colnames(dat))
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) calculate_emd_gene(dat[1,], outcomes, colnames(dat))
This is only function needed when conducting an analysis using the Komolgorov-Smirnov
algorithm. Analyses can also be conducted with the EMD algorithm using
calculate_emd
or the Cramer Von Mises (CVM) algorithm using calculate_cvm
.
The algorithm is used to compare genomics data between any number of groups. Usually the data will be gene expression values from array-based or sequence-based experiments, but data from other types of experiments can also be analyzed (e.g. copy number variation).
Traditional methods like Significance Analysis of Microarrays (SAM) and Linear Models for Microarray Data (LIMMA) use significance tests based on summary statistics (mean and standard deviation) of the two distributions. This approach tends to give non-significant results if the two distributions are highly heterogeneous, which can be the case in many biological circumstances (e.g sensitive vs. resistant tumor samples).
Komolgorov-Smirnov instead calculates a test statistic that is the maximum distance between two cumulative distribution functions (CDFs). Unlike the EMD score, the KS test statistic summarizes only the maximum difference (while EMD considers quantity and distance between all differences).
The KS algorithm implemented in EMDomics has two main steps.
First, a matrix (e.g. of expression data) is divided into data for each of the groups.
Every possible pairwise KS score is then computed and stored in a table. The KS score
for a single gene is calculated by averaging all of the pairwise KS scores. If the user
sets pairwise.p
to true, then the p-values
from the KS test are adjusted using the Benjamini-Hochberg method and stored in a table.
Next, the labels for each of the groups are randomly
permuted a specified number of times, and an EMD score for each permutation is
calculated. The median of the permuted scores for each gene is used as
the null distribution, and the False Discovery Rate (FDR) is computed for
a range of permissive to restrictive significance thresholds. The threshold
that minimizes the FDR is defined as the q-value, and is used to interpret
the significance of the EMD score analogously to a p-value (e.g. q-value
< 0.05 = significant). The q-values returned by the KS test (and adjusted for multiple
significance testing) can be compared to the permuted q-values.
calculate_ks(data, outcomes, nperm = 100, pairwise.p = FALSE, seq = FALSE, quantile.norm = FALSE, verbose = TRUE, parallel = TRUE)
calculate_ks(data, outcomes, nperm = 100, pairwise.p = FALSE, seq = FALSE, quantile.norm = FALSE, verbose = TRUE, parallel = TRUE)
data |
A matrix containing genomics data (e.g. gene expression levels). The rownames should contain gene identifiers, while the column names should contain sample identifiers. |
outcomes |
A vector containing group labels for each of the samples provided
in the |
nperm |
An integer specifying the number of randomly permuted EMD scores to be computed. Defaults to 100. |
pairwise.p |
Boolean specifying whether the user wants the pairwise p-values. Pairwise
p-values returned by |
seq |
Boolean specifying if the given data is RNA Sequencing data and ought to be
normalized. Set to |
quantile.norm |
Boolean specifying is data should be normalized by quantiles. If
|
verbose |
Boolean specifying whether to display progress messages. |
parallel |
Boolean specifying whether to use parallel processing via
the BiocParallel package. Defaults to |
The function returns an KSomics
object.
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_ks(dat, outcomes, nperm=10, parallel=FALSE) head(results$ks)
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_ks(dat, outcomes, nperm=10, parallel=FALSE) head(results$ks)
Calculate KS score for a single gene
calculate_ks_gene(vec, outcomes, sample_names)
calculate_ks_gene(vec, outcomes, sample_names)
vec |
A named vector containing data (e.g. expression data) for a single gene. |
outcomes |
A vector of group labels for the samples. The names must correspond
to the names of |
sample_names |
A character vector with the names of the samples in |
All possible combinations of the classes are used as pairwise comparisons.
The data in vec
is divided based on class labels based on the outcomes
identifiers given. For each pairwise computation, ks.test
is used to compute
the pairwise KS scores. The
total KS score for the given data is the average of the pairwise KS scores.
The KS score is returned.
# 100 genes, 100 samples dat <- matrix(rnorm(100000), nrow=100, ncol=1000) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:1000, sep="") # assign outcomes outcomes <- c(rep(1,500), rep(2,300), rep(3,200)) names(outcomes) <- colnames(dat) calculate_ks_gene(dat[1,], outcomes, colnames(dat))
# 100 genes, 100 samples dat <- matrix(rnorm(100000), nrow=100, ncol=1000) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:1000, sep="") # assign outcomes outcomes <- c(rep(1,500), rep(2,300), rep(3,200)) names(outcomes) <- colnames(dat) calculate_ks_gene(dat[1,], outcomes, colnames(dat))
This is the constructor for objects of class 'CVMomics'. It
is used in calculate_cvm
to construct the return value.
CVMomics(data, outcomes, cvm, cvm.perm, pairwise.cvm.table, pairwise.q.table)
CVMomics(data, outcomes, cvm, cvm.perm, pairwise.cvm.table, pairwise.q.table)
data |
A matrix containing genomics data (e.g. gene expression levels). The rownames should contain gene identifiers, while the column names should contain sample identifiers. |
outcomes |
A vector of group labels for each of the sample identifiers. The
names of this vector must correspond to the column names of |
cvm |
A matrix containing a row for each gene in
The row names should specify the gene identifiers for each row. |
cvm.perm |
A matrix containing a row for each gene in |
pairwise.cvm.table |
A table containing the CVM scores for each pairwise comparison for each gene. For a two-class problem, there should be only one column comparing class 1 and class 2. The row names should be gene identifiers. The column names should be in the format "<class 1> vs <class 2>" (e.g. "1 vs 2" or "A vs B"). |
pairwise.q.table |
A table containing the permutation-based q-values for each
pairwise comparison for each gene. May be |
The function combines its arguments in a list, which is assigned class 'CVMomics'. The resulting object is returned.
This is the constructor for objects of class 'EMDomics'. It
is used in calculate_emd
to construct the return value.
EMDomics(data, outcomes, emd, emd.perm, pairwise.emd.table, pairwise.q.table)
EMDomics(data, outcomes, emd, emd.perm, pairwise.emd.table, pairwise.q.table)
data |
A matrix containing genomics data (e.g. gene expression levels). The rownames should contain gene identifiers, while the column names should contain sample identifiers. |
outcomes |
A vector of group labels for each of the sample identifiers. The
names of this vector must correspond to the column names of |
emd |
A matrix containing a row for each gene in
The row names should specify the gene identifiers for each row. |
emd.perm |
A matrix containing a row for each gene in |
pairwise.emd.table |
A table containing the EMD scores for each pairwise comparison for each gene. For a two-class problem, there should be only one column comparing class 1 and class 2. The row names should be gene identifiers. The column names should be in the format "<class 1> vs <class 2>" (e.g. "1 vs 2" or "A vs B"). |
pairwise.q.table |
A table containing the permutation-based q-values for each
pairwise comparison for each gene. May be |
The function combines its arguments in a list, which is assigned class 'EMDomics'. The resulting object is returned.
This is the constructor for objects of class 'KSomics'. It
is used in calculate_ks
to construct the return value.
KSomics(data, outcomes, ks, ks.perm, pairwise.ks.score, pairwise.ks.q = NULL)
KSomics(data, outcomes, ks, ks.perm, pairwise.ks.score, pairwise.ks.q = NULL)
data |
A matrix containing genomics data (e.g. gene expression levels). The rownames should contain gene identifiers, while the column names should contain sample identifiers. |
outcomes |
A vector of group labels for each of the sample identifiers. The
names of this vector must correspond to the column names of |
ks |
A matrix containing a row for each gene in
The row names should specify the gene identifiers for each row. |
ks.perm |
A matrix containing a row for each gene in |
pairwise.ks.score |
A table containing the KS scores for each pairwise comparison for each gene. For a two-class problem, there should be only one column comparing class 1 and class 2. The row names should be gene identifiers. The column names should be in the format "<class 1> vs <class 2>" (e.g. "1 vs 2" or "A vs B"). |
pairwise.ks.q |
A table of the same dimensions as |
The function combines its arguments in a list, which is assigned class 'KSomics'. The resulting object is returned.
The data for the specified gene is retrieved from
cvmobj$data
. outcomes
is used to divide the data into
distributions for each group, which are then visualized as
density distributions. The calculated CVM score for the specified gene is
displayed in the plot title.
plot_cvm_density(cvmobj, gene_name)
plot_cvm_density(cvmobj, gene_name)
cvmobj |
An |
gene_name |
The gene to visualize. The name should be defined as a row
name in |
A ggplot
object is returned. If the value is
not assigned, a plot will be drawn.
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_cvm(dat, outcomes, nperm=10, parallel=FALSE) plot_cvm_density(results, "gene5")
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_cvm(dat, outcomes, nperm=10, parallel=FALSE) plot_cvm_density(results, "gene5")
The median of the randomly permuted CVM scores (i.e. the null
distribution) is plotted on the x-axis, vs. the observed CVM scores on the
y-axis. The line y=x
is superimposed.
plot_cvmnull(cvmobj)
plot_cvmnull(cvmobj)
cvmobj |
An |
A ggplot
object is returned. If the value is
not assigned, a plot will be drawn.
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "group A" = first 50, "group B" = second 50 groups <- c(rep("A",50),rep("B",50)) names(groups) <- colnames(dat) results <- calculate_cvm(dat, groups, nperm=10, parallel=FALSE) plot_cvmnull(results)
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "group A" = first 50, "group B" = second 50 groups <- c(rep("A",50),rep("B",50)) names(groups) <- colnames(dat) results <- calculate_cvm(dat, groups, nperm=10, parallel=FALSE) plot_cvmnull(results)
The permuted CVM scores stored in cvmobj$cvm.perm
are
plotted as a histogram.
plot_cvmperms(cvmobj)
plot_cvmperms(cvmobj)
cvmobj |
An |
A ggplot
object is returned. If the value is
not assigned, a plot will be drawn.
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_cvm(dat, outcomes, nperm=10, parallel=FALSE) plot_cvmperms(results)
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_cvm(dat, outcomes, nperm=10, parallel=FALSE) plot_cvmperms(results)
The data for the specified gene is retrieved from
emdobj$data
. outcomes
is used to divide the data into
distributions for each group, which are then visualized as
density distributions. The calculated EMD score for the specified gene is
displayed in the plot title.
plot_emd_density(emdobj, gene_name)
plot_emd_density(emdobj, gene_name)
emdobj |
An |
gene_name |
The gene to visualize. The name should be defined as a row
name in |
A ggplot
object is returned. If the value is
not assigned, a plot will be drawn.
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_emd(dat, outcomes, nperm=10, parallel=FALSE) plot_emd_density(results, "gene5")
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_emd(dat, outcomes, nperm=10, parallel=FALSE) plot_emd_density(results, "gene5")
The median of the randomly permuted EMD scores (i.e. the null
distribution) is plotted on the x-axis, vs. the observed EMD scores on the
y-axis. The line y=x
is superimposed.
plot_emdnull(emdobj)
plot_emdnull(emdobj)
emdobj |
An |
A ggplot
object is returned. If the value is
not assigned, a plot will be drawn.
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "group A" = first 50, "group B" = second 50 groups <- c(rep("A",50),rep("B",50)) names(groups) <- colnames(dat) results <- calculate_emd(dat, groups, nperm=10, parallel=FALSE) plot_emdnull(results)
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "group A" = first 50, "group B" = second 50 groups <- c(rep("A",50),rep("B",50)) names(groups) <- colnames(dat) results <- calculate_emd(dat, groups, nperm=10, parallel=FALSE) plot_emdnull(results)
The permuted EMD scores stored in emdobj$emd.perm
are
plotted as a histogram.
plot_emdperms(emdobj)
plot_emdperms(emdobj)
emdobj |
An |
A ggplot
object is returned. If the value is
not assigned, a plot will be drawn.
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_emd(dat, outcomes, nperm=10, parallel=FALSE) plot_emdperms(results)
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_emd(dat, outcomes, nperm=10, parallel=FALSE) plot_emdperms(results)
The data for the specified gene is retrieved from
ksobj$data
. outcomes
is used to divide the data into
distributions for each group, which are then visualized as
density distributions. The calculated KS score for the specified gene is
displayed in the plot title.
plot_ks_density(ksobj, gene_name)
plot_ks_density(ksobj, gene_name)
ksobj |
An |
gene_name |
The gene to visualize. The name should be defined as a row
name in |
A ggplot
object is returned. If the value is
not assigned, a plot will be drawn.
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_ks(dat, outcomes, nperm=10, parallel=FALSE) plot_ks_density(results, "gene5")
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_ks(dat, outcomes, nperm=10, parallel=FALSE) plot_ks_density(results, "gene5")
The median of the randomly permuted KS scores (i.e. the null
distribution) is plotted on the x-axis, vs. the observed KS scores on the
y-axis. The line y=x
is superimposed.
plot_ksnull(ksobj)
plot_ksnull(ksobj)
ksobj |
An |
A ggplot
object is returned. If the value is
not assigned, a plot will be drawn.
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "group A" = first 50, "group B" = second 50 groups <- c(rep("A",50),rep("B",50)) names(groups) <- colnames(dat) results <- calculate_ks(dat, groups, nperm=10, parallel=FALSE) plot_ksnull(results)
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "group A" = first 50, "group B" = second 50 groups <- c(rep("A",50),rep("B",50)) names(groups) <- colnames(dat) results <- calculate_ks(dat, groups, nperm=10, parallel=FALSE) plot_ksnull(results)
The permuted KS scores stored in ksobj$ks.perm
are
plotted as a histogram.
plot_ksperms(ksobj)
plot_ksperms(ksobj)
ksobj |
An |
A ggplot
object is returned. If the value is
not assigned, a plot will be drawn.
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_ks(dat, outcomes, nperm=10, parallel=FALSE) plot_ksperms(results)
# 100 genes, 100 samples dat <- matrix(rnorm(10000), nrow=100, ncol=100) rownames(dat) <- paste("gene", 1:100, sep="") colnames(dat) <- paste("sample", 1:100, sep="") # "A": first 50 samples; "B": next 30 samples; "C": final 20 samples outcomes <- c(rep("A",50), rep("B",30), rep("C",20)) names(outcomes) <- colnames(dat) results <- calculate_ks(dat, outcomes, nperm=10, parallel=FALSE) plot_ksperms(results)