Title: | cfDNAPro extracts and Visualises biological features from whole genome sequencing data of cell-free DNA |
---|---|
Description: | cfDNA fragments carry important features for building cancer sample classification ML models, such as fragment size, and fragment end motif etc. Analyzing and visualizing fragment size metrics, as well as other biological features in a curated, standardized, scalable, well-documented, and reproducible way might be time intensive. This package intends to resolve these problems and simplify the process. It offers two sets of functions for cfDNA feature characterization and visualization. |
Authors: | Haichao Wang [aut, cre], Hui Zhao [ctb], Elkie Chan [ctb], Christopher Smith [ctb], Tomer Kaplan [ctb], Florian Markowetz [ctb], Nitzan Rosenfeld [ctb] |
Maintainer: | Haichao Wang <[email protected]> |
License: | GPL-3 |
Version: | 1.13.0 |
Built: | 2024-10-30 04:39:19 UTC |
Source: | https://github.com/bioc/cfDNAPro |
Calculate the metrics of insert size
callMetrics( path = getwd(), groups, fun = "all", outfmt = "df", input_type, ... )
callMetrics( path = getwd(), groups, fun = "all", outfmt = "df", input_type, ... )
path |
The root folder containing all groups folders, default is the present working folder. |
groups |
The name of the groups, the input value should be vector, e.g. groups=c('group1','group2'), default is all sub-folders in the 'path'. |
fun |
String value, the types of metrics to be calculated. Default is 'all', which means both median and mean values will be returned. |
outfmt |
The output format, a 'list' or 'dataframe' or 'df', default is dataframe. |
input_type |
Character. The input file format, should be one of these: 'picard', 'bam' or 'cfdnapro'. The bam files has to be marked duplicates. |
... |
Further arguments passed to or from other methods. |
The inter valley distance in list or dataframe format.
Haichao Wang
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the metrics. df <- callMetrics(path = path)
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the metrics. df <- callMetrics(path = path)
Calculate the mode fragment size of each sample
callMode( path, groups, outfmt = "df", order = groups, summary, mincount, input_type, ... )
callMode( path, groups, outfmt = "df", order = groups, summary, mincount, input_type, ... )
path |
The root folder containing all groups folders, default is the present working folder. |
groups |
The name of the groups, the input value should be vector, e.g. groups=c('group1','group2'), default is all folders in the folder path. |
outfmt |
The output format, 'list' or 'dataframe' or 'df', default is dataframe. |
order |
The order in the sorted output, default value equals to 'groups' parameter. |
summary |
Summarize the dataframe result by calculating each mode size and its count number. Default value is False. |
mincount |
Minimum count number of each mode size in the summarized output. Only significant when 'summary = TRUE'. |
input_type |
Character. The input file format, should be one of these: 'picard', 'bam', 'cfdnapro'. The bam files has to be marked duplicates. |
... |
Further arguments passed to or from other methods. |
The function returns the inter valley distance in list or dataframe format.
Haichao Wang
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the mode. df <- callMode(path = path)
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the mode. df <- callMode(path = path)
Calculate the inter-peak distance of insert size
callPeakDistance( path = getwd(), groups, limit, outfmt, summary, mincount, input_type, ... )
callPeakDistance( path = getwd(), groups, limit, outfmt, summary, mincount, input_type, ... )
path |
The root folder containing all groups folders. Default is the present working folder. |
groups |
The name of the groups, the input value should be vector, e.g. groups=c('group1','group2'). Default is all folders in the folder path. |
limit |
The insert size range that will be focused on. Default value is 'limit = c(35,135)'. |
outfmt |
The output format, a 'list' or 'dataframe'. Default is dataframe. |
summary |
If TRUE, summarize the output. |
mincount |
The minimum count value of inter-peak distance in the summary. |
input_type |
Character. The input file format, should be one of these: 'picard', 'bam', 'cfdnapro'. The bam files has to be marked duplicates. |
... |
Further arguments passed to or from other methods. |
The function returns the inter peak distance in list or dataframe format.
Haichao Wang
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the inter-peak distance. df <- callPeakDistance(path = path)
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the inter-peak distance. df <- callPeakDistance(path = path)
Calculate the insert size metrics (i.e. prop, cdf, 1-cdf) or each group
callSize(path, groups, outfmt, input_type, ...)
callSize(path, groups, outfmt, input_type, ...)
path |
The root folder containing all groups folders, default is the present working folder. |
groups |
The name of the groups, the input value should be vector, e.g. ‘groups=c(’group1','group2')', default is all folders in the folder path. |
outfmt |
The output format, could specify as 'list' or 'dataframe' or 'df', default is dataframe. |
input_type |
Character. The input file format, should be one of these: 'picard', 'bam', 'cfdnapro'. The bam files has to be marked duplicates. |
... |
Further arguments passed to or from other methods. |
The function returns the insert size metrics of each group in list or dataframe format.
Haichao Wang
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the size. df <- callSize(path = path)
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the size. df <- callSize(path = path)
Calculate the inter-valley distance of insert size
callValleyDistance( path = getwd(), groups, limit, outfmt, summary, mincount, input_type, ... )
callValleyDistance( path = getwd(), groups, limit, outfmt, summary, mincount, input_type, ... )
path |
The root folder containing all groups folders, default is the present working folder. |
groups |
The name of the groups, the input value should be vector, e.g. groups = c('group1','group2'), default is all folders in the folder path. |
limit |
The insert size range that will be focused on, default value is 'limit = c(35,135)'. |
outfmt |
The output format, could specify as 'list' or 'dataframe' or 'df', default is dataframe. |
summary |
If TRUE, summarize the output. |
mincount |
The minimum count value of inter-valley distance. |
input_type |
Character. The input file format should be 'picard' or 'bam', or 'cfdnapro'. The bam files has to be marked duplicates. |
... |
Further arguments passed to or from other methods. |
The inter-valley distance in a list or dataframe.
Haichao Wang
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the inter-valley distance. df <- callValleyDistance(path = path)
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the inter-valley distance. df <- callValleyDistance(path = path)
cfDNAPro package has sample files in 'inst/extdata' directory. This function helps get the path to the data.
examplePath(data = NULL)
examplePath(data = NULL)
data |
Name of data set. Such as "groups_picard" or "step6". If 'NULL', the path of extdata folder will be returned. |
A string. (i.e. the path.)
examplePath() examplePath("groups_picard") examplePath("step6")
examplePath() examplePath("groups_picard") examplePath("step6")
Plot the raw fragment size metrics (e.g. proportion, cdf and 1-cdf) of all groups with different colors in a single plot
plotAllToOne(x, order, plot, vline, xlim, ylim, ...)
plotAllToOne(x, order, plot, vline, xlim, ylim, ...)
x |
A long-format dataframe contains the metrics of different cohort. |
order |
The groups show in the final plot, the input value should be vector, e.g. ‘groups=c(’group1','group2')', default is all folders in the folder path. |
plot |
The plot type, default is 'all' which means all of proportion, cdf and 1-cdf plots will be shown. |
vline |
Vertical dashed lines, default value is 'c(81,167)'. |
xlim |
The x axis range shown in the plot. Default is 'c(0,500)'. |
ylim |
The y axis range shown in the fraction of fragment size plots. Default is 'c(0,0.035)'. |
... |
Further arguments passed to or from other methods. |
The function returns a list plots.
Haichao Wang
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the sizes. df <- callSize(path = path) # Plot all samples from multiple groups into one figure. plot <- plotAllToOne(df)
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the sizes. df <- callSize(path = path) # Plot all samples from multiple groups into one figure. plot <- plotAllToOne(df)
Plot the fragment size metrics (i.e. proportion, cdf and 1-cdf)
plotMetrics(x, order, plot, vline, xlim, ylim, ...)
plotMetrics(x, order, plot, vline, xlim, ylim, ...)
x |
A long-format dataframe contains the metrics of different cohort. |
order |
The groups show in the final plot, the input value should be vector, e.g. ‘groups = c(’group1','group2')“, default is all folders in the folder path |
plot |
The plot type, default is 'all': both median and mean metrics will be shown. They will include: mean_prop, mean_cdf, mean_1-cdf, median_prop, median_cdf, median_1-cdf. Could also specify as "median" or "mean". |
vline |
Vertical dashed lines, default value is c(81,167). |
xlim |
The x axis range shown in the plot. Default is c(0,500). |
ylim |
The y axis range shown in the fraction of fragment size plots. Default is c(0,0.0125). |
... |
Further arguments passed to or from other methods. |
The function returns a list plots.
Haichao Wang
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the metrics. df <- callMetrics(path = path) # Plot metrics. plot <- plotMetrics(df, plot = "median", order = c("cohort_1", "cohort_2") )
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the metrics. df <- callMetrics(path = path) # Plot metrics. plot <- plotMetrics(df, plot = "median", order = c("cohort_1", "cohort_2") )
Plot mode fragment size
plotMode(x, order, type, mincount, hline, ...)
plotMode(x, order, type, mincount, hline, ...)
x |
A long-format dataframe contains the interpeak distance, a template please refer to the result of "callPeakdist" function. |
order |
The groups show in the final plot, the input value should be vector, e.g. 'groups = c("group1","group2")', default is all folders in the folder path. |
type |
The plot type, could choose "bin" or "stacked" chart. Default is bin plot. |
mincount |
Minimum count of mode fragment size that will be included. Count number smaller than this value will be removed first, then proportion of each count value will be calculated. Default value is 0. |
hline |
The horizontal lines added to the bin plot. Default lines will be 'c(81,112,170)'. |
... |
Further arguments passed to or from other methods. |
The function returns the plot.
Haichao Wang
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the modes. df <- callMode(path = path) # Plot modes. plot <- plotMode(df, hline = c(80, 111, 170))
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the modes. df <- callMode(path = path) # Plot modes. plot <- plotMode(df, hline = c(80, 111, 170))
Summarize and plot mode fragment size in a stacked bar chart
plotModeSummary(x, order, summarized, mode_partition, ...)
plotModeSummary(x, order, summarized, mode_partition, ...)
x |
A long-format dataframe contains mode fragment size, a template please refer to the result of 'callMode' function. |
order |
The groups show in the final plot, the input value should be vector, e.g. ‘groups = c(’group1','group2')', default is all folders in the folder path. |
summarized |
Logical value, default is False. |
mode_partition |
This should be a list. This decides how the modes are partitioned in each stacked bar. Default value is 'list(c(80, 81), c(111, 112), c(167))'. Also this function will automatically calculate an 'Others' group which includes the modes not mentioned by users. |
... |
Further arguments passed to or from other methods. |
The function returns the plot.
Haichao Wang
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the modes. df <- callMode(path = path) # Plot mode summary. plot <- plotModeSummary(df, mode_partition = list(c(80, 81), c(111, 112), c(167)) )
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the modes. df <- callMode(path = path) # Plot mode summary. plot <- plotModeSummary(df, mode_partition = list(c(80, 81), c(111, 112), c(167)) )
Plot the inter-peak distance of fragment size distance distribution
plotPeakDistance(x, summarized, order, type, mincount, xlim, ...)
plotPeakDistance(x, summarized, order, type, mincount, xlim, ...)
x |
A long-format dataframe contains the inter-peak distance, a template please refer to the result of 'callPeakDistance' function. |
summarized |
Logical value, describe whether the x is summarzied already. summarized means the count and proportion of each interpeak_dist. |
order |
The groups show in the final plot, the input value should be vector, e.g. ‘groups = c(’group1','group2')', default is all folders in the folder path. |
type |
The plot type, default is line plot, now only support line plot. Don't change this parameter in this version, keep it as default. |
mincount |
Minimum count value of inter peak distance, count number less than this value will be removed first, then proportion of each count value will be calculated. Default value is 0. |
xlim |
The x axis range shown in the plot. Default is 'c(8,13)'. |
... |
Further arguments passed to or from other methods. |
The function returns the line plot of inter peak distance.
Haichao Wang
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the inter-peak distance. df <- callPeakDistance(path = path) # Plot the inter-peak distance. plot <- plotPeakDistance(df, xlim = c(8, 13), mincount = 2 )
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the inter-peak distance. df <- callPeakDistance(path = path) # Plot the inter-peak distance. plot <- plotPeakDistance(df, xlim = c(8, 13), mincount = 2 )
Plot the raw fragment size metrics of single group in a single plot, colored by samples.
plotSingleGroup(x, xlim, ylim, vline, order, plot, ...)
plotSingleGroup(x, xlim, ylim, vline, order, plot, ...)
x |
A long-format dataframe contains the metrics of different cohort. |
xlim |
The x axis range shown in the plot. Default is 'c(0,500)'. |
ylim |
The y axis range shown in the fraction of fragment size plots. Default is 'c(0,0.035)'. |
vline |
Vertical dashed lines, default value is 'c(81,167)'. |
order |
The groups show in the final plot, the input value should be vector, e.g. ‘order = c(’group1')“, default is all groups/cohorts in the folder path. |
plot |
The plot type, default is 'all' which means both proportion, cdf and 1-cdf plots will be shown. |
... |
Further arguments passed to or from other methods. |
The function returns a list plots.
Haichao Wang
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the metrics. df <- callMetrics(path = path) # Plot the only the group specified.. plot <- plotSingleGroup(x = df, order = c("cohort_1"))
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the metrics. df <- callMetrics(path = path) # Plot the only the group specified.. plot <- plotSingleGroup(x = df, order = c("cohort_1"))
Plot the inter-valley distance of fragment size distance distribution
plotValleyDistance(x, order, type, mincount, xlim, ...)
plotValleyDistance(x, order, type, mincount, xlim, ...)
x |
A long-format dataframe contains the inter-valley distance, a template please refer to the result of 'callValleyDistance' function. |
order |
The groups show in the final plot, the input value should be vector, e.g. ‘groups=c(’group1','group2')', default is all folders in the folder path. |
type |
The plot type, default is line plot, now only support line plot. Don't change this parameter in this version, keep it as default. |
mincount |
Minimum count value of inter valley distance, count number less than this value will be removed first, then proportion of each count value will be calculated. Default value is 0. |
xlim |
The x axis range shown in the plot. Default is c(8,13). |
... |
Further arguments passed to or from other methods. |
The function returns the line plot of inter valley distance.
Haichao Wang
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the inter-valley distance. df <- callValleyDistance(path = path) # Plot the inter-valley distance. plot <- plotValleyDistance(df, xlim = c(8, 13), mincount = 2 )
# Get the path to example data. path <- examplePath("groups_picard") # Calculate the inter-valley distance. df <- callValleyDistance(path = path) # Plot the inter-valley distance. plot <- plotValleyDistance(df, xlim = c(8, 13), mincount = 2 )
Calculate insert sizes from a curated GRanges object
read_bam_insert_metrics( bamfile, chromosome_to_keep = paste0("chr", 1:22), strand_mode = 1, genome_label = "hg19", outdir = NA, isize_min = 1L, isize_max = 1000L, ... )
read_bam_insert_metrics( bamfile, chromosome_to_keep = paste0("chr", 1:22), strand_mode = 1, genome_label = "hg19", outdir = NA, isize_min = 1L, isize_max = 1000L, ... )
bamfile |
The bam file name. |
chromosome_to_keep |
Should be a character vector containing the seqnames to be kept in the GRanges object. Default is paste0("chr", 1:22). |
strand_mode |
Usually the strand_mode = 1 means the First read is aligned to positive strand. Details please see GenomicAlignments docs. |
genome_label |
The Genome you used in the alignment. Should be "hg19" or "hg38" or "hg38-NCBI. Default is "hg19". Note: "hg19" will load BSgenome.Hsapiens.UCSC.hg19 package, which is Full genome sequences for Homo sapiens (Human) as provided by UCSC (hg19, based on GRCh37.p13) and stored in Biostrings objects; "hg38" will load BSgenome.Hsapiens.UCSC.hg38 package, which is Full genome sequences for Homo sapiens (Human) as provided by UCSC (hg38, based on GRCh38.p13) and stored in Biostrings objects. "hg38-NCBI" will load BSgenome.Hsapiens.NCBI.GRCh38 package, which is full genome sequences for Homo sapiens (Human) as provided by NCBI (GRCh38, 2013-12-17) and stored in Biostrings objects. |
outdir |
The path for saving rds file. Default is NA, i.e. not saving. |
isize_min |
min fragment length to keep, default is 1L. |
isize_max |
max fragment length to keep, default is 1000L. |
... |
Further arguments passed to or from other methods. |
This function returns a dataframe with two columns: "insert_size" and "All_Reads.fr_count".
Haichao Wang
## Not run: object <- read_bam_insert_metrics(bamfile = "/path/to/bamfile.bam") ## End(Not run)
## Not run: object <- read_bam_insert_metrics(bamfile = "/path/to/bamfile.bam") ## End(Not run)
Read bam file into a curated GRanges object
readBam( bamfile, chromosome_to_keep = paste0("chr", 1:22), strand_mode = 1, genome_label = "hg19", outdir = NA, ... )
readBam( bamfile, chromosome_to_keep = paste0("chr", 1:22), strand_mode = 1, genome_label = "hg19", outdir = NA, ... )
bamfile |
The bam file name. |
chromosome_to_keep |
Should be a character vector containing the seqnames to be kept in the GRanges object. Default is paste0("chr", 1:22). |
strand_mode |
Usually the strand_mode = 1 means the First read is aligned to positive strand. Details please see GenomicAlignments docs. |
genome_label |
The Genome you used in the alignment. Should be "hg19" or "hg38" or "hg38-NCBI. Default is "hg19". Note: "hg19" will load BSgenome.Hsapiens.UCSC.hg19 package, which is Full genome sequences for Homo sapiens (Human) as provided by UCSC (hg19, based on GRCh37.p13) and stored in Biostrings objects; "hg38" will load BSgenome.Hsapiens.UCSC.hg38 package, which is Full genome sequences for Homo sapiens (Human) as provided by UCSC (hg38, based on GRCh38.p13) and stored in Biostrings objects. "hg38-NCBI" will load BSgenome.Hsapiens.NCBI.GRCh38 package, which is full genome sequences for Homo sapiens (Human) as provided by NCBI (GRCh38, 2013-12-17) and stored in Biostrings objects. |
outdir |
The path for saving rds file. Default is NA, i.e. not saving. |
... |
Further arguments passed to or from other methods. |
This function returns curated GRanges object.
Haichao Wang
## Not run: object <- read_bam(bamfile = "/path/to/bamfile.bam", outdir = "./", chromosome_to_keep = c("chr1", "chr2", "chr3")) ## End(Not run)
## Not run: object <- read_bam(bamfile = "/path/to/bamfile.bam", outdir = "./", chromosome_to_keep = c("chr1", "chr2", "chr3")) ## End(Not run)