Title: | Oscope - A statistical pipeline for identifying oscillatory genes in unsynchronized single cell RNA-seq |
---|---|
Description: | Oscope is a statistical pipeline developed to identifying and recovering the base cycle profiles of oscillating genes in an unsynchronized single cell RNA-seq experiment. The Oscope pipeline includes three modules: a sine model module to search for candidate oscillator pairs; a K-medoids clustering module to cluster candidate oscillators into groups; and an extended nearest insertion module to recover the base cycle order for each oscillator group. |
Authors: | Ning Leng |
Maintainer: | Ning Leng <[email protected]> |
License: | Artistic-2.0 |
Version: | 1.37.0 |
Built: | 2024-12-18 03:25:19 UTC |
Source: | https://github.com/bioc/Oscope |
Oscope is a statistical pipeline developed to identifying and recovering the base cycle profiles of oscillating genes in an unsynchronized single cell RNA-seq experiment. The Oscope pipeline includes three modules: a sine model module to search for candidate oscillator pairs; a K-medoids clustering module to cluster candidate oscillators into groups; and an extended nearest insertion module to recover the base cycle order for each oscillator group.
Package: | Oscope |
Type: | Package |
Version: | 0.99.0 |
Date: | 2015-7-27 |
License: | Artistic-2.0 |
Ning Leng
Maintainer: Ning Leng <[email protected]>
Leng et al. Oscope - A statistical pipeline for identifying oscillatory genes in unsynchronized single cell RNA-seq, accepted
Calculate absolute correlations among gene pairs
AbsCor(DataIn, method="pearson", diagNA=TRUE)
AbsCor(DataIn, method="pearson", diagNA=TRUE)
DataIn |
input data, gene-by-sample matrix |
method |
"pearson" or "spearman"; default is "pearson" |
diagNA |
whether replace diagonal values to NA's |
Output is a gene-by-gene matrix; the i, j th entry shows the absolute correlation of the ith and jth gene.
Ning Leng
AbsCor(matrix(rnorm(10),ncol=5))
AbsCor(matrix(rnorm(10),ncol=5))
Calculate estimated mean and variance of RNA-Seq data
CalcMV(Data, Sizes=NULL, NormData=FALSE, MeanCutLow=100, MeanCutHigh=NULL, ApproxVal=10^-6, Plot=TRUE)
CalcMV(Data, Sizes=NULL, NormData=FALSE, MeanCutLow=100, MeanCutHigh=NULL, ApproxVal=10^-6, Plot=TRUE)
Data |
input data matrix; it should be a gene-by-sample or isoform-by sample matrix |
Sizes |
The library size factor for each sample. the number of values in Sizes is expected to be the same as the number of columns of Data. The library size factor will be estimated using the median normalization method implemented in EBSeq if Sizes is specified as NULL. |
NormData |
whether the data is already normalized. If NormData=TRUE, the specification of Sizes will be ignored and no normalization will be applied. |
MeanCutLow , MeanCutHigh
|
we suggests the users to apply Oscope on genes with high mean and high variance. By default, MeanCutLow is specified as 100, consequently only genes with mean > 100 will be used. The CalcMV function will fit a linear regression on log(variance)~log(mean) on these genes. Genes with variance above this line are considered as the high mean high variance genes. The upper bound of mean may be specified using MeanCutHigh. If both are specified as NULL, all of the genes will be considered when fitting the regression. |
ApproxVal |
Default is 10^-6. It is used to approximate the estimate of parameter q for genes/isoforms whose estimated variance is less than estimated mean. q will be estimated using 1-ApproxVal |
Plot |
if Plot = T, a mean-variance plot will be shown. The fitted line will be shown and the selected genes will be marked in green. |
Output is a list with 6 sublists : Mean: estimated means of genes/isoforms; Var: estimated variances; Median: estimated medians; GeneToUse: the high mean high variance genes (suggested input for Oscope); Q: estimated q's (without apporximation); Q_mdf: estimated q's with approximations; Phi_mdf: estimated overdispersion parameter (phi), with approximations.
Ning Leng
exp=matrix(rnorm(100,1000,10),ncol=10) rownames(exp)=paste0("g",1:10) CalcMV(exp)
exp=matrix(rnorm(100,1000,10),ncol=10) rownames(exp)=paste0("g",1:10) CalcMV(exp)
Flag gene clusters with small within-cluster phase differences and/or small within-cluster sine scores
FlagCluster(SineRes, KMRes, Data, qt, thre=pi/4, qtincluster=.5, qtinpermu=.9 ,Seed=1)
FlagCluster(SineRes, KMRes, Data, qt, thre=pi/4, qtincluster=.5, qtinpermu=.9 ,Seed=1)
SineRes |
output of OscopeSine() function |
KMRes |
output of KMRes() function |
Data |
a gene-by-sample (isoform-by-sample) matrix indicating the rescaled expression of genes/isoforms. all values should be between [-1, 1]. |
qt , thre
|
Define a gene pair's linear score as min(eta, pi-eta), in which eta is defined as phase shift mod pi. A cluster will be flagged if the qt th quantile of within-cluster linear score is less than thre. |
qtincluster , qtinpermu
|
To define clusters with small within-cluster sine scores, for each cluster we generate permuted data of these genes (different cell permutation for each gene). We calculate the within-cluster sine scores within the cluster of permuted genes, then infer whether the sine scores in the cluster of interest are greater than those generated by the permuted genes. A cluster will be flagged if its qtincluster th quantile in the original data is less than its qtinpermu th quantile in permuted data. |
Seed |
seed |
Output: RemoveID: a vector of cluster numbers that are flagged; SineCompreList: sine score and permuted sine score for each cluster; LinearList: linear score of each cluster
Ning Leng
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) tmp <- matrix(sin(rnorm(330)),ncol=11) rownames(tmp) <- paste0("tmp",1:30) Dat <- rbind(aa, bb, cc, tmp) res1 <- OscopeSine(Dat) res2 <- OscopeKM(res1, quan=.8, maxK=5) res <- FlagCluster(res1, res2, Dat)
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) tmp <- matrix(sin(rnorm(330)),ncol=11) rownames(tmp) <- paste0("tmp",1:30) Dat <- rbind(aa, bb, cc, tmp) res1 <- OscopeSine(Dat) res2 <- OscopeKM(res1, quan=.8, maxK=5) res <- FlagCluster(res1, res2, Dat)
Format SinFun outputs from lists to matrix
FormatSineOut(result, DataInSc, ShiftRg=pi/4)
FormatSineOut(result, DataInSc, ShiftRg=pi/4)
result |
Output from SineFun |
DataInSc |
a gene-by-sample (isoform-by-sample) matrix indicating the rescaled expression of two genes/isoforms. all values should be bettwen [-1, 1]. |
ShiftRg |
phase shift cutoff. |
Output is a list with 4 sublists, each shows a N-by-N matrix, in which#' N is the total number of genes (isoforms). SimiMat: similarity matrix (sine scores); the sine scores are calculated by -log10(epsilon^2). DiffMat: dissimilarity matrix; shown are epsilon^2 for each gene pair. ShiftMat: optimal phase shift estimate for each pair of genes.
Ning Leng
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) DataInSc <- rbind(aa,bb,cc) NumGene <- nrow(DataInSc) Res <- sapply(1:(NumGene-1),function(i)SineFun(DataInSc, i),simplify=FALSE) Out <- FormatSineOut(Res, DataInSc)
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) DataInSc <- rbind(aa,bb,cc) NumGene <- nrow(DataInSc) Res <- sapply(1:(NumGene-1),function(i)SineFun(DataInSc, i),simplify=FALSE) Out <- FormatSineOut(Res, DataInSc)
Search for the optimal sample order by using the Extended Nearest Insertion
ImpShift(Data, Seq=NULL, NChun=4, RdmStart=FALSE, Ndg=3)
ImpShift(Data, Seq=NULL, NChun=4, RdmStart=FALSE, Ndg=3)
Data |
gene-by-sample matrix or isoform-by-sample matrix.It should be rescaled to values bwteen [-1,1]. |
Seq |
NULL or a vector indicates the sample order. if specified, the samples will be first reordered by this vector. |
NChun |
number of starting points for polynomial fitting. |
RdmStart |
whether the start points are randomly selected. |
Ndg |
degree of polynomial. |
This function performs the extended nearest insertion (ENI). The ENI algorithm searchs for the optimal sample order which minimizes the MSE of sliding polynomial regression (SPR). This function will call PipeShiftCDF() function, which fits SPR to each row of the data. For each gene/isoform, SPR fits NChun polynomial curves with different starting points (samples). The samples with smaller order than the start point will be appended to follow the last sample when fitting. So each fitting consider same number of samples. If RdmStart = TRUE, the start points are randomly selected. Otherwise they are evenly sampled along the sample order. The aggregated MSE of a fit (using a specific start point) is defined as the summation of the MSEs of all genes/isoforms considered here. The MSE of the SPR is defined as the largest aggregated MSE across fits using different start points. The output returns the optimal order which provides the smallest SPR MSE.
Ning Leng
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) res <- ImpShift(rbind(aa,bb,cc), NChun=2)
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) res <- ImpShift(rbind(aa,bb,cc), NChun=2)
Run Extended Nearest Insertion and 2-opt on a gene cluster identified by OscopeKM function
NISFun(ClusterList, DataIn, i, Ndg=3, NChun=4, RdmStart=FALSE, N=20000, NCThre=1000)
NISFun(ClusterList, DataIn, i, Ndg=3, NChun=4, RdmStart=FALSE, N=20000, NCThre=1000)
ClusterList |
a list of gene clusters. Each sublist contains a vector of gene names. |
DataIn |
gene-by-sample matrix or isoform-by-sample matrix.It should be rescaled to values bwteen [-1,1]. |
i |
the cluster of interest. If the second cluster in ClusterList is of interest, specify i=2. |
Ndg |
degree of polynomial. |
NChun |
number of starting points for polynomial fitting. |
RdmStart |
whether the start points are randomly selected. |
N , NCThre
|
The 2-opt algorithm will stop if N iterations has been performed or if the optimal order remains unchanged for over NCThre iterations. |
This function performs the extended nearest insertion (ENI) and 2-opt algorithm to a particular cluster identified by OscopeKM function. The ENI algorithm searchs for the optimal sample order which minimizes the MSE of sliding polynomial regression (SPR). This function will call PipeShiftCDF() function, which fits SPR to each row of the data. For each gene/isoform, SPR fits NChun polynomial curves with different starting points (samples). The samples with smaller order than the start point will be appended to follow the last sample when fitting. So each fitting consider same number of samples. If RdmStart = TRUE, the start points are randomly selected. Otherwise they are evenly sampled along the sample order. The aggregated MSE of a fit (using a specific start point) is defined as the summation of the MSEs of all genes/isoforms considered here. The MSE of the SPR is defined as the largest aggregated MSE across fits using different start points. The output of PipeShiftCDF() returns the optimal order which provides the smallest SPR MSE. The 2-opt algorithm is then applied to improve the optimal order searching of the ENI. In each iteration, 2-opt algorithm will randomly choose two points (samples), the flip the samples between these two points. The new order will be adapted if it provides smaller SPR MSE. The output returns the optimal order for the cluster of interest.
Ning Leng
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) res <- NISFun(list(c("aa","bb"),"cc"), rbind(aa,bb,cc),i=1, NChun=2, N=50)
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) res <- NISFun(list(c("aa","bb"),"cc"), rbind(aa,bb,cc),i=1, NChun=2, N=50)
Rescale the gene/isoform expression matrix
NormForSine(Data, qt1=.05, qt2=.95)
NormForSine(Data, qt1=.05, qt2=.95)
Data |
input gene-by-sample matrix or isoform-by-sample matrix |
qt1 , qt2
|
thresholds for outlier adjustment. For each gene/isoform, values <= qt1 th quantile (>= qt2 th quantile) will be pushed to qt1 th quantile (qt2 th quantile) prior to the scaling. default values are 0.05 and 0.95. |
The output will be a gene-by-sample or isoform-by-sample matrix. For each gene/isoform, the expressions will be scaled linearly to [-1,1]
Ning Leng
NormForSine(matrix(rnorm(10), nrow=2))
NormForSine(matrix(rnorm(10), nrow=2))
Run the 2-opt algorithm to improve the optimal order searching of the Extended Nearest Insertion
Opt2Shift(Data,N=20000,Seq,Ndg=3,NChun=4, NCThre=1000, RdmStart=FALSE)
Opt2Shift(Data,N=20000,Seq,Ndg=3,NChun=4, NCThre=1000, RdmStart=FALSE)
Data |
gene-by-sample matrix or isoform-by-sample matrix. It should be rescaled to values bwteen [-1,1]. |
N , NCThre
|
The 2-opt algorithm will stop if N iterations has been performed or if the optimal order remains unchanged for over NCThre iterations. |
Seq |
a vector indicates the sample order obtained from the ENI. |
Ndg |
degree of polynomial. |
NChun |
number of starting points for polynomial fitting. |
RdmStart |
whether the start points are randomly selected. |
This function performs the the 2-opt algorithm to improve the optimal order searching of the Extended Nearest Insertion (ENI). In each iteration, the function will randomly choose two points (samples), the flip the samples between these two points. The new order will be adapted if it provides smaller SPR MSE. The output returns the optimal order and its SPR MSE.
Ning Leng
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) res <- ImpShift(rbind(aa,bb,cc), NChun=2) res2 <- Opt2Shift(rbind(aa,bb,cc), NChun=2, N=50, Seq=res)
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) res <- ImpShift(rbind(aa,bb,cc), NChun=2) res2 <- Opt2Shift(rbind(aa,bb,cc), NChun=2, N=50, Seq=res)
Search for the optimal sample order for different gene clusters
OscopeENI(KMRes, Data, ClusterUse=NULL, Ndg=3, NChun=4, RdmStart=FALSE, N=20000, NCThre=1000, parallel=FALSE, parallelParam=NULL)
OscopeENI(KMRes, Data, ClusterUse=NULL, Ndg=3, NChun=4, RdmStart=FALSE, N=20000, NCThre=1000, parallel=FALSE, parallelParam=NULL)
KMRes |
output of OscopeKM() function. |
Data |
gene-by-sample matrix or isoform-by-sample matrix. It should be rescaled to values bwteen [-1,1]. |
ClusterUse |
a vector indicating what clusters are of interest. For example, by setting ClusterUse = c(1,2,3), only the top 3 clusters will be considered while recovering the base cycle order. If ClusterUse=NULL, all clusters will be used. |
Ndg |
degree of polynomial. |
NChun |
number of starting points for polynomial fitting. |
RdmStart |
whether the start points are randomly selected. |
N , NCThre
|
The 2-opt algorithm will stop if N iterations has been performed or if the optimal order |
parallel |
whether apply parallel computing. if it is TRUE, BiocParallel will be called. |
parallelParam |
a SnowParam object to specify the clusters. If it is NULL, the default will be set as SnowParam(workers = 5, type = "SOCK") remains unchanged for over NCThre iterations. |
This function performs the extended nearest insertion (ENI) and 2-opt algorithm to all clusters (or a subset of picked clusters) identified by OscopeKM function. The function will recover independent orders to each of the clusters. For each cluster, the ENI algorithm will be applied to search for the optimal sample order which minimizes the MSE of sliding polynomial regression (SPR). This function will call PipeShiftCDF() function, which fits SPR to expression of each gene/isoform within a cluster. For each gene/isoform, SPR fits NChun polynomial curves with different starting points (samples). The samples with smaller order than the start point will be appended to follow the last sample when fitting. So each fitting consider same number of samples. If RdmStart = TRUE, the start points are randomly selected. Otherwise they are evenly sampled along the sample order. The aggregated MSE of a fit (using a specific start point) is defined as the summation of the MSEs of all genes/isoforms considered here. The MSE of the SPR is defined as the largest aggregated MSE across fits using different start points. The output of PipeShiftCDF() returns the optimal order which provides the smallest SPR MSE. The 2-opt algorithm will then be applied to improve the optimal order searching of the ENI. In each iteration, the 2-opt algorithm will randomly choose two points (samples), the flip the samples between these two points. The new order will be adapted if it provides smaller SPR MSE. The output returns the optimal order for each cluster of interest. It is a list with multiple sublists, in which each sublist includes the recovered order of the corresponding cluster in ClusterUse. If ClusterUse is not specified, the k th sublist shows the recovered order in KMRes
Ning Leng
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) dd <- sin(seq(1.2,2.2,.1)) res <- OscopeENI(list(c1=c("aa","bb"),c2=c("cc","dd")), rbind(aa,bb,cc,dd), NChun=2, N=50)
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) dd <- sin(seq(1.2,2.2,.1)) res <- OscopeENI(list(c1=c("aa","bb"),c2=c("cc","dd")), rbind(aa,bb,cc,dd), NChun=2, N=50)
Simulated gene expression to evaluate Oscope. 600 genes and 30 cell are simulated. The expression mean of the genes are randomly simulated in the range of 10-10000.
GeneExampleData is a matrix with 600 genes (rows) and 30 cells (columns).
data(OscopeExampleData) str(OscopeExampleData)
data(OscopeExampleData) str(OscopeExampleData)
Oscope K medoid module
OscopeKM(SineRes, quan=.95,cut=NULL,maxK=NULL,minSize=0, maxSize=200, fixK=NULL, rawscale=TRUE)
OscopeKM(SineRes, quan=.95,cut=NULL,maxK=NULL,minSize=0, maxSize=200, fixK=NULL, rawscale=TRUE)
SineRes |
output of OscopeSine function. |
quan |
only gene pairs with similarity score >= quan th quantile will be considered in the clustering analyses. Default is 0.95. |
cut |
pre-defined cutoff. Gene pairs with similarity score >= cut will be considered in cluster analyses. If cut is defined, quan will be ignored. |
maxK |
max number of clusters to consider (scan). if numbC=NULL, it will be calculated as [number of gene considered]/10 |
minSize , maxSize
|
Only clusters with minSize<= cluster size <= maxSize are reported in output. |
fixK |
if fixK is specified, the k-medoids algorithm will be applied with fixK clusters. |
rawscale |
Recall the input is the similarity matrix (-log10(distance from the sine model)). the k-medoids clustering will be applied using (-Input) as distance. If rawscale is defined as TRUE, the k-medoids clustering will be applied using -10^Input as distance. |
OscopeKM() calls scanK() function, which runs k-medoid clustering with varying number of clusters (k). The k is varied from 2 to maxK. The input should be the output of OscopeSine() function. scanK() function will cluster genes in gene pairs with high similarity score (the threshold can be defined using parameter quan). To select the top genes, the function first calculate the max similarity score for each gene, then select the genes with high max score.
The output object shows members in each cluster. clusters are sorted by median similarity score within cluster.
Ning Leng
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) tmp <- matrix(sin(rnorm(330)),ncol=11) rownames(tmp) <- paste0("tmp",1:30) Dat <- rbind(aa, bb, cc, tmp) res1 <- OscopeSine(Dat) res2 <- OscopeKM(res1, quan=.8, maxK=5)
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) tmp <- matrix(sin(rnorm(330)),ncol=11) rownames(tmp) <- paste0("tmp",1:30) Dat <- rbind(aa, bb, cc, tmp) res1 <- OscopeSine(Dat) res2 <- OscopeKM(res1, quan=.8, maxK=5)
Apply sine model on the full set of genes or isoforms
OscopeSine(DataInSc, parallel=FALSE, parallelParam=NULL)
OscopeSine(DataInSc, parallel=FALSE, parallelParam=NULL)
DataInSc |
a gene-by-sample (isoform-by-sample) matrix indicating the rescaled expression of two genes/isoforms. all values should be bettwen [-1, 1]. |
parallel |
whether apply parallel computing. if it is TRUE, BiocParallel will be called. |
parallelParam |
a SnowParam object to specify the clusters. If it is NULL, the default will be set as SnowParam(workers = 5, type = "SOCK") remains unchanged for over NCThre iterations. |
Output is a list with 4 sublists, each shows a N-by-N matrix, in which N is the total number of genes (isoforms). SimiMat: similarity matrix (sine scores); the sine scores are calculated by -log10(epsilon^2). DiffMat: dissimilarity matrix; shown are epsilon^2 for each gene pair. ShiftMat: optimal phase shift estimate for each pair of genes.
Ning Leng
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) OscopeSine(rbind(aa,bb,cc))
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) OscopeSine(rbind(aa,bb,cc))
Define sine scroe cutoff using permuted data
PermuCut(Data, NumPermu=1000)
PermuCut(Data, NumPermu=1000)
Data |
a gene-by-sample (isoform-by-sample) matrix indicating the rescaled expression of genes/isoforms. all values should be between [-1, 1]. |
NumPermu |
number of permuted genes to generate. |
Output contains a vector of numbers. Each number presents max sine score of a given permuted gene.
Ning Leng
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) tmp <- matrix(sin(rnorm(330)),ncol=11) rownames(tmp) <- paste0("tmp",1:30) Dat <- rbind(aa, bb, cc, tmp) res1 <- PermuCut(Dat,100)
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) tmp <- matrix(sin(rnorm(330)),ncol=11) rownames(tmp) <- paste0("tmp",1:30) Dat <- rbind(aa, bb, cc, tmp) res1 <- PermuCut(Dat,100)
Calculate residual of polynomial fit
PipeR(Data,Ndg=3,Method="Poly")
PipeR(Data,Ndg=3,Method="Poly")
Data |
gene-by-sample matrix or isoform-by-sample matrix.It should be rescaled to values bwteen [-1,1]. |
Ndg |
degree of polynomial. |
Method |
only polynomial fitting ("Poly") is available now. |
The function will fit polynomial curve to each row of the data. The output returns the MSE of each row (gene/isoform).
Ning Leng
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) res <- PipeR(rbind(aa,bb,cc))
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) res <- PipeR(rbind(aa,bb,cc))
Calculate residual of the sliding polynomial regression
PipeShiftCDF(Data, Ndg=3, NChun=4, RdmStart=FALSE)
PipeShiftCDF(Data, Ndg=3, NChun=4, RdmStart=FALSE)
Data |
gene-by-sample matrix or isoform-by-sample matrix. It should be rescaled to values bwteen [-1,1]. |
Ndg |
degree of polynomial. |
NChun |
number of starting points for polynomial fitting. |
RdmStart |
whether the start points are randomly selected. |
The function will fit sliding polynomial regression (SPR) to each row of the data. For each gene/isoform, SPR fits NChun polynomial curves with different starting points (samples). The samples with smaller order than the start point will be appended to follow the last sample when fitting. So each fitting consider same number of samples. If RdmStart = TRUE, the start points are randomly selected. Otherwise they are evenly sampled along the sample order. The aggregated MSE of a fit (using a specific start point) is defined as the summation of the MSEs of all genes/isoforms considered here. The output returns the MSE of the SPR, which is the largest aggregated MSE across fits using different start points.
Ning Leng
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) res <- PipeShiftCDF(rbind(aa,bb,cc), NChun=2)
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) res <- PipeShiftCDF(rbind(aa,bb,cc), NChun=2)
Run k-medoid algorithm with varying k on similarity matrix
scanK(SimiMatIn, quan=.95,cut=NULL, maxK=NULL,minSize=0, maxSize=200, fixK=NULL, rawscale=FALSE)
scanK(SimiMatIn, quan=.95,cut=NULL, maxK=NULL,minSize=0, maxSize=200, fixK=NULL, rawscale=FALSE)
SimiMatIn |
gene-by-gene similarity matrix |
quan |
only gene pairs with similarity score >= quan th quantile will be considered in the cluster analyses. Default is 0.95. |
cut |
pre-defined cutoff. Gene pairs with similarity score >= cut will be considered in cluster analyses. If cut is defined, quan will be ignored. |
maxK |
max number of clusters to consider (scan). if numbC=NULL, it will be calculated as [number of gene considered]/10. |
minSize , maxSize
|
Only clusters with minSize<= cluster size <= maxSize are reported in output. |
fixK |
if fixK is specified, the k-medoids algorithm will be applied with fixK clusters. |
rawscale |
Recall the input is the similarity matrix (-log10(distance from the sine model)). the k-medoids clustering will be applied using (-Input) as distance. If rawscale is defined as TRUE, the k-medoids clustering will be applied using -10^Input as distance. |
scanK() function runs k-medoid clustering with varying number of clusters (k). The k is varied from 2 to maxK. The input of scanK() function should be a similarity matrix. scanK() function will cluster genes in gene pairs with high similarity score (the threshold can be defined using parameter quan). To select the top genes, the function first calculate the max similarity score for each gene, then select the genes with high max score.
The output object is a list with 4 sublists: membOut: members in each cluster. clusters are sorted by median similarity score within cluster;
MedCor: median similarity score for each cluster;
Mat: input similarity matrix;
filteredMat: similarity matrix, only showing the top genes used in clustering;
Kcluster: cluster indicator of each top gene.
Ning Leng
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) tmp <- matrix(sin(rnorm(330)),ncol=11) rownames(tmp) <- paste0("tmp",1:30) Dat <- rbind(aa, bb, cc, tmp) res1 <- OscopeSine(Dat) res2 <- scanK(res1$SimiMat, quan=.8, maxK=5)
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) tmp <- matrix(sin(rnorm(330)),ncol=11) rownames(tmp) <- paste0("tmp",1:30) Dat <- rbind(aa, bb, cc, tmp) res1 <- OscopeSine(Dat) res2 <- scanK(res1$SimiMat, quan=.8, maxK=5)
Apply sine model on one particular gene vs. other genes
SineFun(DataInSc,i)
SineFun(DataInSc,i)
DataInSc |
a gene-by-sample (isoform-by-sample) matrix indicating the rescaled expression of two genes/isoforms. all values should be bettwen [-1, 1]. |
i |
the gene (isoform) of interest. The function will apply the sine model on gene (isoform) i vs. gene (isoform) j for all j > i. Gene (isoform) i (j) is defined as the gene (isoform )shown in the i (j) th row. i should be smaller than the total number of genes (isoforms). |
Output is a list with two sublists, each shows the optimal phi's (shift) and epsilon's (value). N-i entries will be included in each sublist (N is the total number of genes/isoforms). The kth entry indicates results of gene (isoform) i vs. i+k.
Ning Leng
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) SineFun(rbind(aa,bb,cc), 1)
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) cc <- sin(seq(0.9,1.9,.1)) SineFun(rbind(aa,bb,cc), 1)
Function for searching optimal phase shift
SineOptim(Pairdata)
SineOptim(Pairdata)
Pairdata |
a sample-by-2 matrix indicating the rescaled expression of two genes/isoforms. all values should be bettwen [-1, 1]. |
Output provides the optimal phi (shift) and its corresponding epsilon^2 (value) of the sine model. epsilon_g1,g2^2 = sum_s [X_g1,s^2+X^2_g2,s -2X_g1,sX_g2,s cos(phi_g1,g2)-sin^2(phi_g1,g2)]^2
Ning Leng
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) SineOptim(cbind(aa,bb))
aa <- sin(seq(0,1,.1)) bb <- sin(seq(0.5,1.5,.1)) SineOptim(cbind(aa,bb))