Title: | YARN: Robust Multi-Condition RNA-Seq Preprocessing and Normalization |
---|---|
Description: | Expedite large RNA-Seq analyses using a combination of previously developed tools. YARN is meant to make it easier for the user in performing basic mis-annotation quality control, filtering, and condition-aware normalization. YARN leverages many Bioconductor tools and statistical techniques to account for the large heterogeneity and sparsity found in very large RNA-seq experiments. |
Authors: | Joseph N Paulson [aut, cre], Cho-Yi Chen [aut], Camila Lopes-Ramos [aut], Marieke Kuijjer [aut], John Platig [aut], Abhijeet Sonawane [aut], Maud Fagny [aut], Kimberly Glass [aut], John Quackenbush [aut] |
Maintainer: | Joseph N Paulson <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.33.0 |
Built: | 2024-12-30 07:15:12 UTC |
Source: | https://github.com/bioc/yarn |
Annotate your Expression Set with biomaRt
annotateFromBiomart(obj, genes = featureNames(obj), filters = "ensembl_gene_id", attributes = c("ensembl_gene_id", "hgnc_symbol", "chromosome_name", "start_position", "end_position"), biomart = "ensembl", dataset = "hsapiens_gene_ensembl", ...)
annotateFromBiomart(obj, genes = featureNames(obj), filters = "ensembl_gene_id", attributes = c("ensembl_gene_id", "hgnc_symbol", "chromosome_name", "start_position", "end_position"), biomart = "ensembl", dataset = "hsapiens_gene_ensembl", ...)
obj |
ExpressionSet object. |
genes |
Genes or rownames of the ExpressionSet. |
filters |
getBM filter value, see getBM help file. |
attributes |
getBM attributes value, see getBM help file. |
biomart |
BioMart database name you want to connect to. Possible database names can be retrieved with teh function listMarts. |
dataset |
Dataset you want to use. To see the different datasets available within a biomaRt you can e.g. do: mart = useMart('ensembl'), followed by listDatasets(mart). |
... |
Values for useMart, see useMart help file. |
ExpressionSet object with a fuller featureData.
data(skin) # subsetting and changing column name just for a silly example skin <- skin[1:10,] colnames(fData(skin)) = paste("names",1:6) biomart<-"ENSEMBL_MART_ENSEMBL"; genes <- sapply(strsplit(rownames(skin),split="\\."),function(i)i[1]) newskin <-annotateFromBiomart(skin,genes=genes,biomar=biomart) head(fData(newskin)[,7:11])
data(skin) # subsetting and changing column name just for a silly example skin <- skin[1:10,] colnames(fData(skin)) = paste("names",1:6) biomart<-"ENSEMBL_MART_ENSEMBL"; genes <- sapply(strsplit(rownames(skin),split="\\."),function(i)i[1]) newskin <-annotateFromBiomart(skin,genes=genes,biomar=biomart) head(fData(newskin)[,7:11])
Bladder RNA-seq data from the GTEx consortium. V6 release.
data(bladder)
data(bladder)
An object of class "ExpressionSet"
; see ExpressionSet
.
ExpressionSet object
GTEx Portal
GTEx Consortium, 2015. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348(6235), pp.648-660. (PubMed)
data(bladder); checkMissAnnotation(bladder);
data(bladder); checkMissAnnotation(bladder);
Check for wrong annotation of a sample using classical MDS and control genes.
checkMisAnnotation(obj, phenotype, controlGenes = "all", columnID = "chromosome_name", plotFlag = TRUE, legendPosition = NULL, ...)
checkMisAnnotation(obj, phenotype, controlGenes = "all", columnID = "chromosome_name", plotFlag = TRUE, legendPosition = NULL, ...)
obj |
ExpressionSet object. |
phenotype |
phenotype column name in the phenoData slot to check. |
controlGenes |
Name of controlGenes, ie. 'Y' chromosome. Can specify 'all'. |
columnID |
Column name where controlGenes is defined in the featureData slot if other than 'all'. |
plotFlag |
TRUE/FALSE Whether to plot or not |
legendPosition |
Location for the legend. |
... |
Extra parameters for |
Plots a classical multi-dimensional scaling of the 'controlGenes'. Optionally returns co-ordinates.
data(bladder) checkMisAnnotation(bladder,'GENDER',controlGenes='Y',legendPosition='topleft')
data(bladder) checkMisAnnotation(bladder,'GENDER',controlGenes='Y',legendPosition='topleft')
Check tissues to merge based on gene expression profile
checkTissuesToMerge(obj, majorGroups, minorGroups, filterFun = NULL, plotFlag = TRUE, ...)
checkTissuesToMerge(obj, majorGroups, minorGroups, filterFun = NULL, plotFlag = TRUE, ...)
obj |
ExpressionSet object. |
majorGroups |
Column name in the phenoData slot that describes the general body region or site of the sample. |
minorGroups |
Column name in the phenoData slot that describes the specific body region or site of the sample. |
filterFun |
Filter group specific genes that might disrupt PCoA analysis. |
plotFlag |
TRUE/FALSE whether to plot or not |
... |
Parameters that can go to |
CMDS Plots of the majorGroupss colored by the minorGroupss. Optional matrix of CMDS loadings for each comparison.
checkTissuesToMerge
data(skin) checkTissuesToMerge(skin,'SMTS','SMTSD')
data(skin) checkTissuesToMerge(skin,'SMTS','SMTSD')
Downloads the V6 GTEx release and turns it into an ExpressionSet object.
downloadGTEx(type = "genes", file = NULL, ...)
downloadGTEx(type = "genes", file = NULL, ...)
type |
Type of counts to download - default genes. |
file |
File path and name to automatically save the downloaded GTEx expression set. Saves as a RDS file. |
... |
Does nothing currently. |
Organized ExpressionSet set.
# obj <- downloadGTEx(type='genes',file='~/Desktop/gtex.rds')
# obj <- downloadGTEx(type='genes',file='~/Desktop/gtex.rds')
This returns the raw counts, log2-transformed raw counts, or normalized expression. If normalized = TRUE then the log paramater is ignored.
extractMatrix(obj, normalized = FALSE, log = TRUE)
extractMatrix(obj, normalized = FALSE, log = TRUE)
obj |
ExpressionSet object or objrix. |
normalized |
TRUE / FALSE, use the normalized matrix or raw counts |
log |
TRUE/FALSE log2-transform. |
matrix
data(skin) head(yarn:::extractMatrix(skin,normalized=FALSE,log=TRUE)) head(yarn:::extractMatrix(skin,normalized=FALSE,log=FALSE))
data(skin) head(yarn:::extractMatrix(skin,normalized=FALSE,log=TRUE)) head(yarn:::extractMatrix(skin,normalized=FALSE,log=FALSE))
The main use case for this function is the removal of sex-chromosome genes. Alternatively, filter genes that are not protein-coding.
filterGenes(obj, labels = c("X", "Y", "MT"), featureName = "chromosome_name", keepOnly = FALSE)
filterGenes(obj, labels = c("X", "Y", "MT"), featureName = "chromosome_name", keepOnly = FALSE)
obj |
ExpressionSet object. |
labels |
Labels of genes to filter or keep, eg. X, Y, and MT |
featureName |
FeatureData column name, eg. chr |
keepOnly |
Filter or keep only the genes with those labels |
Filtered ExpressionSet object
data(skin) filterGenes(skin,labels = c('X','Y','MT'),featureName='chromosome_name') filterGenes(skin,labels = 'protein_coding',featureName='gene_biotype',keepOnly=TRUE)
data(skin) filterGenes(skin,labels = c('X','Y','MT'),featureName='chromosome_name') filterGenes(skin,labels = 'protein_coding',featureName='gene_biotype',keepOnly=TRUE)
Filter genes that have less than a minimum threshold CPM for a given group/tissue
filterLowGenes(obj, groups, threshold = 1, minSamples = NULL, ...)
filterLowGenes(obj, groups, threshold = 1, minSamples = NULL, ...)
obj |
ExpressionSet object. |
groups |
Vector of labels for each sample or a column name of the phenoData slot. for the ids to filter. Default is the column names. |
threshold |
The minimum threshold for calling presence of a gene in a sample. |
minSamples |
Minimum number of samples - defaults to half the minimum group size. |
... |
Options for cpm. |
Filtered ExpressionSet object
cpm function defined in the edgeR package.
data(skin) filterLowGenes(skin,'SMTSD')
data(skin) filterLowGenes(skin,'SMTSD')
The main use case for this function is the removal of missing genes.
filterMissingGenes(obj, threshold = 0)
filterMissingGenes(obj, threshold = 0)
obj |
ExpressionSet object. |
threshold |
Minimum sum of gene counts across samples – defaults to zero. |
Filtered ExpressionSet object
data(skin) filterMissingGenes(skin)
data(skin) filterMissingGenes(skin)
Filter samples
filterSamples(obj, ids, groups = colnames(obj), keepOnly = FALSE)
filterSamples(obj, ids, groups = colnames(obj), keepOnly = FALSE)
obj |
ExpressionSet object. |
ids |
Names found within the groups labels corresponding to samples to be removed |
groups |
Vector of labels for each sample or a column name of the phenoData slot for the ids to filter. Default is the column names. |
keepOnly |
Filter or keep only the samples with those labels. |
Filtered ExpressionSet object
data(skin) filterSamples(skin,ids = "Skin - Not Sun Exposed (Suprapubic)",groups="SMTSD") filterSamples(skin,ids=c("GTEX-OHPL-0008-SM-4E3I9","GTEX-145MN-1526-SM-5SI9T"))
data(skin) filterSamples(skin,ids = "Skin - Not Sun Exposed (Suprapubic)",groups="SMTSD") filterSamples(skin,ids=c("GTEX-OHPL-0008-SM-4E3I9","GTEX-145MN-1526-SM-5SI9T"))
This function provides a wrapper to various normalization methods developed. Currently it only wraps qsmooth and quantile normalization returning a log-transformed normalized matrix. qsmooth is a normalization approach that normalizes samples in a condition aware manner.
normalizeTissueAware(obj, groups, normalizationMethod = c("qsmooth", "quantile"), ...)
normalizeTissueAware(obj, groups, normalizationMethod = c("qsmooth", "quantile"), ...)
obj |
ExpressionSet object |
groups |
Vector of labels for each sample or a column name of the phenoData slot for the ids to filter. Default is the column names |
normalizationMethod |
Choice of 'qsmooth' or 'quantile' |
... |
Options for |
ExpressionSet object with an assayData called normalizedMatrix
The function qsmooth comes from the qsmooth packages currently available on github under user 'kokrah'.
data(skin) normalizeTissueAware(skin,"SMTSD")
data(skin) normalizeTissueAware(skin,"SMTSD")
This function plots the MDS coordinates for the "n" features of interest. Potentially uncovering batch effects or feature relationships.
plotCMDS(obj, comp = 1:2, normalized = FALSE, distFun = dist, distMethod = "euclidian", n = NULL, samples = TRUE, log = TRUE, plotFlag = TRUE, ...)
plotCMDS(obj, comp = 1:2, normalized = FALSE, distFun = dist, distMethod = "euclidian", n = NULL, samples = TRUE, log = TRUE, plotFlag = TRUE, ...)
obj |
ExpressionSet object or objrix. |
comp |
Which components to display. |
normalized |
TRUE / FALSE, use the normalized matrix or raw counts. |
distFun |
Distance function, default is dist. |
distMethod |
The distance measure to be used. This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". Any unambiguous substring can be given. |
n |
Number of features to make use of in calculating your distances. |
samples |
Perform on samples or genes. |
log |
TRUE/FALSE log2-transform raw counts. |
plotFlag |
TRUE/FALSE whether to plot or not. |
... |
Additional plot arguments. |
coordinates
data(skin) res <- plotCMDS(skin,pch=21,bg=factor(pData(skin)$SMTSD)) # library(calibrate) # textxy(X=res[,1],Y=res[,2],labs=rownames(res))
data(skin) res <- plotCMDS(skin,pch=21,bg=factor(pData(skin)$SMTSD)) # library(calibrate) # textxy(X=res[,1],Y=res[,2],labs=rownames(res))
Plots the density of the columns of a matrix. Wrapper for matdensity
.
plotDensity(obj, groups = NULL, normalized = FALSE, legendPos = NULL, ...)
plotDensity(obj, groups = NULL, normalized = FALSE, legendPos = NULL, ...)
obj |
ExpressionSet object |
groups |
Vector of labels for each sample or a column name of the phenoData slot for the ids to filter. Default is the column names. |
normalized |
TRUE / FALSE, use the normalized matrix or log2-transformed raw counts |
legendPos |
Legend title position. If null, does not create legend by default. |
... |
Extra parameters for matdensity. |
A density plot for each column in the ExpressionSet object colored by groups
data(skin) filtData <- filterLowGenes(skin,"SMTSD") plotDensity(filtData,groups="SMTSD",legendPos="topleft") # to remove the legend plotDensity(filtData,groups="SMTSD")
data(skin) filtData <- filterLowGenes(skin,"SMTSD") plotDensity(filtData,groups="SMTSD",legendPos="topleft") # to remove the legend plotDensity(filtData,groups="SMTSD")
This function plots a heatmap of the gene expressions forthe "n" features of interest.
plotHeatmap(obj, n = NULL, fun = stats::sd, normalized = TRUE, log = TRUE, ...)
plotHeatmap(obj, n = NULL, fun = stats::sd, normalized = TRUE, log = TRUE, ...)
obj |
ExpressionSet object or objrix. |
n |
Number of features to make use of in plotting heatmap. |
fun |
Function to sort genes by, default |
normalized |
TRUE / FALSE, use the normalized matrix or raw counts. |
log |
TRUE/FALSE log2-transform raw counts. |
... |
Additional plot arguments for |
coordinates
data(skin) tissues <- pData(skin)$SMTSD plotHeatmap(skin,normalized=FALSE,log=TRUE,trace="none",n=10) # Even prettier # library(RColorBrewer) data(skin) tissues <- pData(skin)$SMTSD heatmapColColors <- brewer.pal(12,"Set3")[as.integer(factor(tissues))] heatmapCols <- colorRampPalette(brewer.pal(9, "RdBu"))(50) plotHeatmap(skin,normalized=FALSE,log=TRUE,trace="none",n=10, col = heatmapCols,ColSideColors = heatmapColColors,cexRow = 0.6,cexCol = 0.6)
data(skin) tissues <- pData(skin)$SMTSD plotHeatmap(skin,normalized=FALSE,log=TRUE,trace="none",n=10) # Even prettier # library(RColorBrewer) data(skin) tissues <- pData(skin)$SMTSD heatmapColColors <- brewer.pal(12,"Set3")[as.integer(factor(tissues))] heatmapCols <- colorRampPalette(brewer.pal(9, "RdBu"))(50) plotHeatmap(skin,normalized=FALSE,log=TRUE,trace="none",n=10, col = heatmapCols,ColSideColors = heatmapColColors,cexRow = 0.6,cexCol = 0.6)
This function was modified from github user kokrah.
qsmooth(obj, groups, norm.factors = NULL, plot = FALSE, window = 0.05, log = TRUE)
qsmooth(obj, groups, norm.factors = NULL, plot = FALSE, window = 0.05, log = TRUE)
obj |
for counts use log2(raw counts + 1)), for MA use log2(raw intensities) |
groups |
groups to which samples belong (character vector) |
norm.factors |
scaling normalization factors |
plot |
plot weights? (default=FALSE) |
window |
window size for running median (a fraction of the number of rows of exprs) |
log |
Whether or not the data should be log transformed before normalization, TRUE = YES. |
Normalized expression
Kwame Okrah's qsmooth R package
data(skin) head(yarn:::qsmooth(skin,groups=pData(skin)$SMTSD))
data(skin) head(yarn:::qsmooth(skin,groups=pData(skin)$SMTSD))
This function was directly borrowed from github user kokrah.
qstats(exprs, groups, window)
qstats(exprs, groups, window)
exprs |
for counts use log2(raw counts + 1)), for MA use log2(raw intensities) |
groups |
groups to which samples belong (character vector) |
window |
window size for running median as a fraction on the number of rows of exprs |
list of statistics
Kwame Okrah's qsmooth R package Compute quantile statistics
Skin RNA-seq data from the GTEx consortium. V6 release. Random selection of 20 skin samples. 13 of the samples are fibroblast cells, 5 Skin sun exposed, 2 sun unexposed.
data(skin)
data(skin)
An object of class "ExpressionSet"
; see ExpressionSet
.
ExpressionSet object
GTEx Portal
GTEx Consortium, 2015. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science, 348(6235), pp.648-660. (PubMed)
data(skin); checkMissAnnotation(skin,"GENDER");
data(skin); checkMissAnnotation(skin,"GENDER");