Title: | Filtering of Lowly Expressed Features |
---|---|
Description: | Filtering of lowly expressed features (e.g. genes) is a common step before performing statistical analysis, but an arbitrary threshold is generally chosen. SeqGate implements a method that rationalize this step by the analysis of the distibution of counts in replicate samples. The gate is the threshold above which sequenced features can be considered as confidently quantified. |
Authors: | Christelle Reynès [aut], Stéphanie Rialle [aut, cre] |
Maintainer: | Stéphanie Rialle <[email protected]> |
License: | GPL (>= 2.0) |
Version: | 1.17.0 |
Built: | 2024-11-30 04:34:17 UTC |
Source: | https://github.com/bioc/SeqGate |
SeqGate is a method to filter lowly expressed features (e.g. genes).
From a matrix of counts where lines correspond to features and columns to
biological samples, provided as a SummarizedExperiment object, a threshold
is computed and applied in order to filter lowly expressed features. The
threshold is computed based on the distribution of counts measured along
with zeros within replicates of the same condition. The objective of SeqGate
is to rationalize the filtering step by using the information of replicate
samples. The computed threshold corresponds to the count value below which
we can not be sure that the count can be considered different from zero.
The filtering is made by calling the applySeqGate()
function.
Christelle Reynès [email protected],
Stéphanie Rialle [email protected],
Maintainer: Stéphanie Rialle <[email protected]>
Rialle, R. et al. (2020): SeqGate: a bioconductor package to perform data-driven filtering of RNAseq datasets manuscript in preparation
# Loading of input data frame data(data_MiTF_1000genes) # Annotating conditions cond<-c("A","A","B","B","A","B") # Setting the SummarizedExperiment input rowData <- DataFrame(row.names=rownames(data_MiTF_1000genes)) colData <- DataFrame(Conditions=cond) counts_strub <- SummarizedExperiment( assays=list(counts=data_MiTF_1000genes), rowData=rowData, colData=colData) # Applying SeqGate counts_strub <- applySeqGate(counts_strub,"counts","Conditions") # Getting the matrix of kept genes after filtering keptGenes <- assay(counts_strub[rowData(counts_strub)$onFilter == TRUE,])
# Loading of input data frame data(data_MiTF_1000genes) # Annotating conditions cond<-c("A","A","B","B","A","B") # Setting the SummarizedExperiment input rowData <- DataFrame(row.names=rownames(data_MiTF_1000genes)) colData <- DataFrame(Conditions=cond) counts_strub <- SummarizedExperiment( assays=list(counts=data_MiTF_1000genes), rowData=rowData, colData=colData) # Applying SeqGate counts_strub <- applySeqGate(counts_strub,"counts","Conditions") # Getting the matrix of kept genes after filtering keptGenes <- assay(counts_strub[rowData(counts_strub)$onFilter == TRUE,])
Implements the SeqGate methods to filter lowly expressed features (e.g. genes)
applySeqGate(readCountsSE, assayName, colCond, prop0=NULL, percentile=NULL, propUpThresh=0.9)
applySeqGate(readCountsSE, assayName, colCond, prop0=NULL, percentile=NULL, propUpThresh=0.9)
readCountsSE |
|
assayName |
character string giving the name of the assay in |
colCond |
character string giving the name of the column describing the biological
conditions to which the samples belong to, in the colData DataFrame
of |
prop0 |
minimal proportion of zeros among a condition to consider that the
feature is not or lowly expressed. |
percentile |
percentile used on the 'max' distribution to determine the filtering
threshold value. |
propUpThresh |
proportion of counts to be above the threshold in at least one condition
to keep the feature. |
In order to find a theshold value to filter lowly expressed features,
SeqGate analyzes the distribution of counts found in replicates along with
zero counts. More specifically, features with a proportion of at least
prop0
zeros in one condition are selected. The distribution of counts
found in replicates of that same condition along with those zeros is
computed. The chosen threshold is the count value corresponding to the
percentile
of this distribution. Finally, features having a
proportion propUpThresh
of replicates with counts below that value
in all conditions are filtered.
If prop0
is not provided, it is set to the number of replicates minus
one divided by the max total number of replicates. For example, if one of
the condition of the experiment counts 2 replicates and the other condition
4 replicates, the proportion prop0
will be set to (4-1)/4=0.75.
If percentile
is not provided, it is set to 0.90 if at least one of
the conditions have 5 replicates or less. Otherwise, it is set to a value
comprised between 0.5 and 0.9, depending on the number or replicates.
The input SummarizedExperiment
with the following added elements:
rowData(readCountsSE)$onFilter |
vector which |
metadata(readCountsSE)$threshold |
applied filter threshold. |
Christelle Reynès [email protected],
Stéphanie Rialle [email protected]
Rialle, R. et al. (2020): SeqGate: a bioconductor package to perform data-driven filtering of RNAseq datasets (manuscript in preparation)
# Loading of input data frame data(data_MiTF_1000genes) # Annotating conditions cond<-c("A","A","B","B","A","B") # Setting the SummarizedExperiment input rowData <- DataFrame(row.names=rownames(data_MiTF_1000genes)) colData <- DataFrame(Conditions=cond) counts_strub <- SummarizedExperiment( assays=list(counts=data_MiTF_1000genes), rowData=rowData, colData=colData) # Applying SeqGate counts_strub <- applySeqGate(counts_strub,"counts","Conditions") # Getting the matrix of kept genes after filtering keptGenes <- assay(counts_strub[rowData(counts_strub)$onFilter == TRUE,])
# Loading of input data frame data(data_MiTF_1000genes) # Annotating conditions cond<-c("A","A","B","B","A","B") # Setting the SummarizedExperiment input rowData <- DataFrame(row.names=rownames(data_MiTF_1000genes)) colData <- DataFrame(Conditions=cond) counts_strub <- SummarizedExperiment( assays=list(counts=data_MiTF_1000genes), rowData=rowData, colData=colData) # Applying SeqGate counts_strub <- applySeqGate(counts_strub,"counts","Conditions") # Getting the matrix of kept genes after filtering keptGenes <- assay(counts_strub[rowData(counts_strub)$onFilter == TRUE,])
This data set is an extract from a dataset from human transciptome studied by Strub et al. (2011).
data(data_MiTF_1000genes)
data(data_MiTF_1000genes)
A data frame with 1,000 raw gene counts on the following 6 samples.
A1
a numeric vector: counts from sample A1 (condition A)
A2
a numeric vector: counts from sample A2 (condition A)
B1
a numeric vector: counts from sample B1 (condition B)
B2
a numeric vector: counts from sample B2 (condition B)
A3
a numeric vector: counts from sample A3 (condition A)
B3
a numeric vector: counts from sample B3 (condition B)
The extract counts 1,000 genes measured in 6 samples, from 2 biological conditions (cells expressing the Microphtalmia Transcription Factor (MiTF) are compared to cells in which the MiTF is repressed). The raw read counts have been retrieved from the Supplementary Materials of Dillies et al. (2012). The initial dataset counts 37,720 genes.
Dillies, M.A. et al. (2012) A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinformatics 2013 Nov;14(6):671-83
Strub, T., Giuliano, S., Ye T,, et al. Essential role of microphthalmia transcription factor for DNA replication, mitosis and genomic stability in melanoma. Oncogene 2011;30:2319–32
data(data_MiTF_1000genes)
data(data_MiTF_1000genes)