Title: | Integrative Statistics of alleLe Dependent Expression |
---|---|
Description: | This package provides ISoLDE a new method for identifying imprinted genes. This method is dedicated to data arising from RNA sequencing technologies. The ISoLDE package implements original statistical methodology described in the publication below. |
Authors: | Christelle Reynès [aut, cre], Marine Rohmer [aut], Guilhem Kister [aut] |
Maintainer: | Christelle Reynès <[email protected]> |
License: | GPL (>= 2.0) |
Version: | 1.35.0 |
Built: | 2024-10-30 07:30:24 UTC |
Source: | https://github.com/bioc/ISoLDE |
This package provides a new method for identifying genes with allelic bias. This method is dedicated to data arising from RNA sequencing technologies. The ISoLDE package implements original statistical methodology described in the publication below.
ISoLDE method has been motivated by several literature limitations in taking
into account the data specificities and in making the most of biological
replicates. It is based on the definition of a new criterion using robust
estimation of the data variability. Variability estimation is of high
importance in statistical testing procedures because a difference significance
can only be assessed with regards to the intrinsic data variability.
Two methods are available to identify allele specific expression: one is based on bootstrap resampling while the second one uses an empirical threshold. The first one is much more satisfying and is likely to give the most reliable results but it can only be applied to data with at least three biological replicates for each reciprocal cross. While strongly recommending to use at least three replicates, the second method implements a robust solution when only two replicates are available.
Christelle Reynès [email protected],
Marine Rohmer [email protected],
Guilhem Kister [email protected]
Reynès, C. et al. (2016): ISoLDE: a new method for identification of allelic imbalance. Submitted
A data.frame
containing the normalized and filtered values of allele
specific read counts,
for an experiment with more than two replicates. This data.frame
is
obtained with the
filterT
function run on the normASRcounts
data.frame
.
A data.frame
.
This data.frame
is obtained with the filterT
function.
Each line represents a feature (e.g. a gene, transcript).
Each column represents the number of allele specific sens reads from
either the paternal or maternal parent for a given biological replicate,
so that you expect to have two columns per biological replicate.
Values in the matrix are filtered and normalized (RLE method) ASR counts.
Extract from Bouschet, T. et al. (2016): In vitro corticogenesis from embryonic stem cells recapitulates the in vivo epigenetic control of imprinted gene expression. Submitted Subset of 6062 genes (after filtering).
Bouschet, T. et al. (2016): In vitro corticogenesis from embryonic stem cells recapitulates the in vivo epigenetic control of imprinted gene expression. Submitted
filterT
: a function to filter the ASR counts and produce the
filteredASRcounts
data.frame
.isolde_test
: example of which uses
filteredASRcounts
.
data(filteredASRcounts)
data(filteredASRcounts)
Filter lowly expressed genes (or transcripts) according to a data driven threshold, before any statistical analysis. This step is not mandatory but strongly recommended.
filterT(rawASRcounts, normASRcounts, target, tol_filter = 0, bias)
filterT(rawASRcounts, normASRcounts, target, tol_filter = 0, bias)
rawASRcounts |
the |
normASRcounts |
the |
target |
the |
tol_filter |
a value between 0 and 100 allowing to introduce tolerance rate into filtering
step: |
bias |
The kind of allele expression bias you want to study. It must be one of “parental” or “strain”. |
Filtering in statistical analysis is recommended to avoid considering genes (or transcript) without enough information, and thus to avoid a too strong effect of multiple test correction.
The aim of our filtering method is to eliminate from analysis not enough
quantified genes, that is genes having mostly counts of 0 or near 0 for each
replicate in at least one condition (parent, strain). In this purpose, the
filterT
function searches for the distribution of counts of a gene
in a condition when most of read counts are 0 for this condition.
This distribution allows to define a threshold. Hence, genes having less counts
than this threshold are eliminated.
The filtering step is not mandatory but strongly recommended.
A list of two data.frame
:
filteredASRcounts |
This |
removedASRcounts |
This |
Each line represents a feature (e.g. a gene, transcript).
Each column represents the number of allele-specific sens reads from either
the paternal or maternal parent for a given biological replicate, so that
you expect to have two columns per biological replicate.
filterT
output on normalized data is the typical input for
isolde_test
.
A minimal filtering step will always be performed while applying the
isolde_test
function.
It consists of eliminating all genes not satisfying these two conditions:
- At least one of the two medians (of paternal or maternal ASR counts) is
different from 0;
- There is at least one ASR count (different from 0) in each cross.
Marine Rohmer [email protected],
Christelle Reynès [email protected]
Reynès, C. et al. (2016): ISoLDE: a new method for identification of allelic imbalance. Submitted
# Loading all required data.frames data(rawASRcounts) data(normASRcounts) data(target) # Filtering genes from the ASR count data.frame in parental bias study res_filterT <- filterT(rawASRcounts = rawASRcounts, normASRcounts = normASRcounts, target = target, bias="parental") filteredASRcounts <- res_filterT$filteredASRcounts removedASRcounts <- res_filterT$removedASRcounts
# Loading all required data.frames data(rawASRcounts) data(normASRcounts) data(target) # Filtering genes from the ASR count data.frame in parental bias study res_filterT <- filterT(rawASRcounts = rawASRcounts, normASRcounts = normASRcounts, target = target, bias="parental") filteredASRcounts <- res_filterT$filteredASRcounts removedASRcounts <- res_filterT$removedASRcounts
The main function of the ISoLDE package. Performs statistical test to identify genes with allelic bias and produces both graphical and textual outputs.
isolde_test(bias, method = "default", asr_counts, target, nboot = 5000, pcore = 75, graph = TRUE, ext = "pdf", text = TRUE, split_files = FALSE, prefix = "ISoLDE_result", outdir = "")
isolde_test(bias, method = "default", asr_counts, target, nboot = 5000, pcore = 75, graph = TRUE, ext = "pdf", text = TRUE, split_files = FALSE, prefix = "ISoLDE_result", outdir = "")
bias |
The kind of bias you want to study. It must be one of “parental” or “strain”. |
method |
specifies the statistical method to use for testing. It must be one of
“default” or “threshold”. Default behaviour is to adapt to the
number of replicates: when at least three biological replicates for each
reciprocal cross are available the bootstrap resampling method is used, else
the threshold method is applied. It is possible to force
|
asr_counts |
the |
target |
the target |
nboot |
specifies how many resampling steps to do for the bootstrap method.
This option is not considered if “threshold” value is set for
|
pcore |
a value between 0 and 100 (default to 75) which specifies the proportion of cores (in percent) to be used for the bootstrap method. |
graph |
if |
ext |
specifies the extension of the graphical file output (does not work if
graph = |
text |
if |
split_files |
if text = |
prefix |
specifies the prefix for all output file names (default to "ISoLDE_result"). |
outdir |
specifies the path where to write the output file(s) (default to current directory). |
Before using this function, your data should be normalized and filtered
(see the filterT
function for filtering) although the function
can run with non-normalized and/or non-filtered data.
The method depends on your minimum number of replicates for each reciprocal
cross.
If only one replicate is found, the test can not be achieved and exits.
method=“default” : If more than two replicates per cross, the method
takes advantage of having enough information by using bootstrap resampling
to identify genes with allelic bias.
If only two replicates are found in at least one cross, there is too few
information to obtain reliable distributions from resampling.
Genes with allelic bias are identified thanks to empirically defined
thresholds.
method=“threshold” : The empirical method will be processed instead of
the bootstrap one, even if more than two replicates per cross are found.
Note that in differential RNA-seq analysis, at least three replicates are
strongly recommended, as variability estimation quality is a key factor in
statistical analysis.
More details in Reynès, C. et al. (2016) ISoLDE: a new method for
identification of allelic imbalance. Submitted
listASE |
a |
listBA |
a |
listUN |
a |
listFILT |
a |
ASE, BA and UN lists are sorted according to their criterion value.
The bootstrap resampling step is performed many times (default to 5000). Hence, the function may run for a long time if performing the bootstrap method (until several minutes).
A minimal filtering step will always be performed while applying the
isolde_test
function.
It consists of eliminating all genes not satisfying these two conditions:
- At least one of the two medians (of paternal or maternal ASR counts) is
different from 0;
- There is at least one ASR count (different from 0) in each cross.
Christelle Reynès [email protected],
Marine Rohmer [email protected]
Reynès, C. et al. (2016): ISoLDE: a new method for identification of allelic imbalance. Submitted
# Loading all required data.frames data(filteredASRcounts) data(target) # Statistical analysis (forcing the threshold option) isolde_res <- isolde_test(bias = "parental", method = "threshold", asr_counts = filteredASRcounts, target = target, ext = "pdf", prefix = "ISoLDE_test")
# Loading all required data.frames data(filteredASRcounts) data(target) # Statistical analysis (forcing the threshold option) isolde_res <- isolde_test(bias = "parental", method = "threshold", asr_counts = filteredASRcounts, target = target, ext = "pdf", prefix = "ISoLDE_test")
normASRcounts_file.txt: A tab-delimited text file containing the normalized
values of ASR counts for an experiment with more than two replicates.
normASRcounts.rda: the normASRcounts_file.txt loaded into a data.frame by thereadNormInput
function.
normASRcounts_file.txt: A tab-delimited file.
normASRcounts.rda: A data.frame.
Each line represents a feature (e.g. a gene or a transcript).
Each column represents the number of allele-specific sens reads from either
the paternal or maternal parent for a given biological replicate, so that
you expect to have two columns per biological replicate.
Values in the matrix are normalized (RLE method) ASR counts.
In case of double input, columns must be in the same order in both raw and
normalized ASR counts files.
The normASRcounts_file.txt file should be read and checked by the
readNormInput
function.
A minimum of two biological replicates per cross is mandatory, however, we
strongly recommend to use more than two replicates per cross. This enables a
better estimation of variability and to use the bootstrap method to perform the
statistical test (see the isolde_test
function).
Extract from Bouschet, T. et al. (2016): In vitro corticogenesis from embryonic stem cells recapitulates the in vivo epigenetic control of imprinted gene expression. Submitted Subset of 6062 genes (after filtering).
Bouschet, T. et al. (2016): In vitro corticogenesis from embryonic stem cells recapitulates the in vivo epigenetic control of imprinted gene expression. Submitted
readNormInput
example of which uses the
normASRcounts
file.
# normASRcounts_file.txt normfile <- system.file("extdata", "normASRcounts_file.txt", package = "ISoLDE") normASRcounts <- readNormInput(norm_file = normfile, del = "tab", rownames = TRUE, colnames = TRUE) # normASRcounts.rda data(normASRcounts)
# normASRcounts_file.txt normfile <- system.file("extdata", "normASRcounts_file.txt", package = "ISoLDE") normASRcounts <- readNormInput(norm_file = normfile, del = "tab", rownames = TRUE, colnames = TRUE) # normASRcounts.rda data(normASRcounts)
rawASRcounts_file.txt: A tab-delimited text file containing the raw values of
ASR counts for an experiment with more than two replicates.
rawASRcounts.rda: the rawASRcounts_file.txt loaded into a data.frame by the
readRawInput
function.
rawASRcounts_file.txt: A tab-delimited file.
rawASRcounts.rda: A data.frame.
Each line represents a feature (e.g. a gene or a transcript).
Each column represents the number of allele-specific sens reads from either
the paternal or maternal parent for a given biological replicate, so that
you expect to have two columns per biological replicate.
Values in the matrix are raw allele-specific read counts.
In case of double input, columns must be in the same order in both raw and
normalized ASR counts files.
The rawASRcounts_file.txt file should be read and checked by the
readRawInput
function.
A minimum of two biological replicates per cross is mandatory, however, we
strongly recommend to use more than two replicates per cross. This enables a
better estimation of variability and to use the bootstrap method to perform the
statistical test (see the isolde_test
function).
Extract from Bouschet, T. et al. (2016): In vitro corticogenesis from embryonic stem cells recapitulates the in vivo epigenetic control of imprinted gene expression. Submitted Subset of 6062 genes (after filtering).
Bouschet, T. et al. (2016): In vitro corticogenesis from embryonic stem cells recapitulates the in vivo epigenetic control of imprinted gene expression. Submitted
readRawInput
example of which uses the rawASRcounts
file.
# rawASRcounts_file.txt rawfile <- system.file("extdata", "rawASRcounts_file.txt", package = "ISoLDE") rawASRcounts <- readRawInput(raw_file = rawfile, del = "tab", colnames = TRUE, rownames = TRUE) # rawASRcounts.rda data(rawASRcounts)
# rawASRcounts_file.txt rawfile <- system.file("extdata", "rawASRcounts_file.txt", package = "ISoLDE") rawASRcounts <- readRawInput(raw_file = rawfile, del = "tab", colnames = TRUE, rownames = TRUE) # rawASRcounts.rda data(rawASRcounts)
Checks and loads into a data.frame
the input file containing
normalized allele-specific read (ASR) counts so that it can be input into
filterT
and isolde_test
.
readNormInput(norm_file, del = "\t", rownames = TRUE, colnames = TRUE, dec = ".")
readNormInput(norm_file, del = "\t", rownames = TRUE, colnames = TRUE, dec = ".")
norm_file |
A character-delimited input file containing normalized counts such as
described in |
del |
Specifies the delimiter for the input file, usually a semi-colon ";", a coma "," or a tabulation "\t". (default : "\t"). Note : None of your data values must contain this delimiter (be specially careful in gene names). |
rownames |
Specifies if the file contains some row names to consider. Possible values: TRUE or FALSE (default: TRUE). |
colnames |
Specifies if the file contains some column names to consider. Possible values: TRUE or FALSE (default: TRUE). |
dec |
Specifies the character used in the file for decimal mark (default : "."). |
A data.frame
containing normalized ASR counts from your input file.
Marine Rohmer [email protected],
Christelle Reynès [email protected]
normASRcounts_file.txt
: the normalized ASR count file on which to
run thereadNormInput
function.readRawInput
: a similar function for raw (non-normalized) ASR
count files.
# character-delimited input file containing normalized ASR counts normfile <- system.file("extdata", "normASRcounts_file.txt", package = "ISoLDE") # loading it into a data.frame using the readNormInput function nbreadnorm <- readNormInput(norm_file = normfile, del = "tab", rownames = TRUE, colnames = TRUE, dec = ".")
# character-delimited input file containing normalized ASR counts normfile <- system.file("extdata", "normASRcounts_file.txt", package = "ISoLDE") # loading it into a data.frame using the readNormInput function nbreadnorm <- readNormInput(norm_file = normfile, del = "tab", rownames = TRUE, colnames = TRUE, dec = ".")
Checks and loads into a data.frame
the input file containing
raw allele specific read (ASR) counts so that it can be input into
filterT
.
readRawInput(raw_file, del = "\t", rownames = TRUE, colnames = TRUE)
readRawInput(raw_file, del = "\t", rownames = TRUE, colnames = TRUE)
raw_file |
A character-delimited input file containing raw ASR counts such as described
in |
del |
Specifies the delimiter for the input file, usually a semi-colon ";", a coma "," or a tabulation "\t". (default : "\t"). Note : None of your data values must contain this delimiter (be specially careful in gene names). |
rownames |
Specifies if the file contains some row names to consider. Possible values: TRUE or FALSE (default: TRUE). |
colnames |
Specifies if the file contains some column names to consider. Possible values: TRUE or FALSE (default: TRUE). |
Raw ASR counts are only required for the filtering step (with the
filterT
function) in case the normalized data do not contain 0
counts anymore.
If you do not want to perform the filtering step or if you still have 0 counts
in your normalized file, you do not need to load raw ASR counts.
(For simplicity purpose, we call '0 count' any value of zero in a count file).
A data.frame
containing raw ASR counts from your input file.
Marine Rohmer [email protected],
Christelle Reynès [email protected]
rawASRcounts_file.txt
: the raw ASR count file on which to run the
readRawInput
function.
readNormInput
: a similar function for normalized ASR count file.
# character-delimited input file containing raw ASR counts rawfile <- system.file("extdata", "rawASRcounts_file.txt", package = "ISoLDE") # loading it into a data.frame using the readRawInput function nbread <- readRawInput(raw_file = rawfile, del = "tab", rownames = TRUE, colnames = TRUE)
# character-delimited input file containing raw ASR counts rawfile <- system.file("extdata", "rawASRcounts_file.txt", package = "ISoLDE") # loading it into a data.frame using the readRawInput function nbread <- readRawInput(raw_file = rawfile, del = "tab", rownames = TRUE, colnames = TRUE)
Checks and loads into a data.frame
your target input file.
readTarget(target_file, asr_counts, del = "\t")
readTarget(target_file, asr_counts, del = "\t")
target_file |
A character-delimited text input file, containing metadata about ASR counts
files (see |
asr_counts |
The |
del |
Specifies the delimiter for the target input file, usually a semi-colon ";", a coma "," or a tabulation "\t". (default : "\t"). Note : None of your data values must contain this delimiter (be specially careful in gene names). |
See target_file.txt
for more details about the target_file
format.
a data.frame
containing the target.
Marine Rohmer [email protected],
Christelle Reynès [email protected]
target_file.txt
: the metadata file on which to run the
readTarget
function.
# Target input file targetfile <- system.file("extdata", "target_file.txt", package = "ISoLDE") # The data.frame containing ASR counts is also required data(rawASRcounts) # Load into a data.frame and check the target file target <- readTarget(target_file = targetfile, asr_counts = rawASRcounts, del = "\t")
# Target input file targetfile <- system.file("extdata", "target_file.txt", package = "ISoLDE") # The data.frame containing ASR counts is also required data(rawASRcounts) # Load into a data.frame and check the target file target <- readTarget(target_file = targetfile, asr_counts = rawASRcounts, del = "\t")
target_file.txt: A tab-delimited file describing your input data (raw and /
or normalized allele specific read (ASR) count file(s)).
Each line of the target file corresponds to a column of the
rawASRcounts
and / or normASRcounts
data.frames
.
Lines of target file MUST be in the same order as the columns in ASR count data.
Each line contains four values, separated by a character (e.g. a tabulation) :
samples
, parent
, strain
and replicate
(see the
Details section for more information).
target.rda: The target_file.txt file loaded into a data.frame by the
link{readTarget}
function.
target_file.txt: A tab-delimited text file.
target.rda: A data.drame.
Details of the three columns : sample
: the biological sample name. A same sample name has to appear
twice in the target file : one line for the maternal allele and one line for the
paternal allele.allele
: the parental origin of the ASR count. Two possible values:
maternal
or paternal
.strain
: the strain origin of the ASR count. Exactly two different values
have to be provided in the whole file.
The first line of the target file has to contain these column names in the same
order.
These metadata are required for both filterT
and
isolde_test
functions.
Factice example:
sample,parent,strain
samp1,maternal,str1
samp1,paternal,str2
samp2,maternal,str1
samp2,paternal,str2
samp3,maternal,str1
samp3,paternal,str2
samp4,maternal,str1
samp4,paternal,str2
Marine Rohmer [email protected],
Christelle Reynès [email protected]
readTarget
is a function to load into a data.frame
and check the
input target file.