Given the methylation sequencing data of a cell-free DNA (cfDNA) sample, for each cancer marker or tissue marker, we deconvolve the tumor-derived or tissue-specific reads from all reads falling in the marker region. Our read-based deconvolution algorithm exploits the pervasiveness of DNA methylation for signal enhancement, therefore can sensitively identify a trace amount of tumor-specific or tissue-specific cfDNA in plasma.
Specifically, cfTools can deconvolve different sources of cfDNA fragments (or reads) in two contexts:
Cancer detection [1]: separate cfDNA fragments into tumor-derived fragments and background normal fragments (2 classes), and estimate the tumor-derived cfDNA fraction θ (0 ≤ θ < 1).
Tissue deconvolution [2,3]: separate cfDNA fragments from different tissues (> 2 classes), and estimate the cfDNA fraction θt (0 ≤ θt < 1) of different tissue types (including an unknown type) t for a plasma cfDNA sample.
These functions can serve as foundations for more advanced cfDNA-based studies, including cancer diagnosis and disease monitoring.
cfTools
is an R
package available via the Bioconductor repository for packages.
You can install the release version by using the following commands in
your R
session:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("cfTools")
Alternatively, you can install the development version from GitHub :
The two main input files for CancerDetector()
and
cfDeconvolve()
are
Input 1: fragment-level methylation states (methState), which are represented by a sequence of binary values (0 represents unmethylated CpG and 1 represents methylated CpG on the same fragment);
Input 2: methylation pattern (paired shape parameters of beta distributions) of markers.
cfSort()
mainly takes Input 1 as the
only input file.
Section 3.1, 3.2, 3.3 provide an example for generating Input 1. We require users to provide pre-processed paired-end bisulfite sequencing reads (i.e., aligned to the reference genome). For each cfDNA sample, users need to prepare (1) the standard (sorted) BED file of the aligned reads and (2) the methylation information that bismark extracts from the aligned reads as input data files. Specifically,
MergePEReads()
MergeCpGs()
CpG_OT*
file and a CpG_OB*
file
generated by bismark methylation extractorGenerateFragMeth()
MergePEReads()
and
MergeCpGs()
Section 3.4 provides an example for generating Input 2. Specifically,
GenerateMarkerParam()
Example input data files are included within the package:
Function MergePEReads()
generates fragment-level
information for paired-end sequencing reads. The main input file is a
standard (sorted) BED file (e.g. output of bedtools bamtobed)
of paired-end sequencing reads containing columns of chromosome name,
chromosome start, chromosome end, sequence ID, mapping quality score,
and strand.
PEReads <- file.path(demo.dir, "demo.sorted.bed.txt.gz")
head(read.table(PEReads, header=FALSE), 2)
#> V1 V2 V3
#> 1 chr21 9417768 9417888
#> 2 chr21 9418101 9418223
#> V4
#> 1 GWNJ-0965:805:GW2108234048th:1:2211:31223:60642_1:N:0:ATTCCT:R1:CAAGTCCC:R2:ACATTATG:F1:TGACT:F2:AGACT/2
#> 2 GWNJ-0965:805:GW2108234048th:1:2211:31223:60642_1:N:0:ATTCCT:R1:CAAGTCCC:R2:ACATTATG:F1:TGACT:F2:AGACT/1
#> V5 V6
#> 1 0 +
#> 2 0 -
The output is a list in BED file format and/or written to an output BED file, where each line contains the information of a cfDNA fragment.
fragInfo <- MergePEReads(PEReads)
#> + /github/home/.cache/R/basilisk/1.19.0/0/bin/conda create --yes --prefix /github/home/.cache/R/basilisk/1.19.0/cfTools/1.7.0/new_env 'python=3.8' --quiet -c conda-forge --override-channels
#> + /github/home/.cache/R/basilisk/1.19.0/0/bin/conda install --yes --prefix /github/home/.cache/R/basilisk/1.19.0/cfTools/1.7.0/new_env 'python=3.8' -c conda-forge --override-channels
#> + /github/home/.cache/R/basilisk/1.19.0/0/bin/conda install --yes --prefix /github/home/.cache/R/basilisk/1.19.0/cfTools/1.7.0/new_env -c conda-forge 'python=3.8' 'python=3.8' 'numpy=1.23.4' 'scipy=1.8.0' 'tensorflow=2.10.0' --override-channels
head(fragInfo, 2)
#> chr start end fragmentLength strand
#> 1 chr21 9439599 9439741 142 -
#> 2 chr21 9483455 9483622 167 -
#> name
#> 1 GWNJ-0965:805:GW2108234048th:1:1101:11972:44486_1:N:0:ATTCCT:R1:TTACGGCG:R2:CAGTGGAG:F1:TGACT:F2:TGACT
#> 2 GWNJ-0965:805:GW2108234048th:1:1101:20933:67955_1:N:0:ATTCCT:R1:GGGACTTT:R2:CGCGAGTA:F1:TGACT:F2:TGACT
Function MergeCpGs()
generates fragment-level
methylation states of CpGs. The main inputs of it are two output files
of bismark methylation extractor, which is a program performing
methylation calling on bisulfite treated sequencing reads. The
CpG_OT*
file contains methylation information for CpGs on
the original top strand (OT); the CpG_OB*
file contains
methylation information for CpGs on the original bottom strand (OB).
Both files contain columns of sequence ID, methylation state, chromosome
name, chromosome start, methylation call.
CpG_OT <- file.path(demo.dir, "CpG_OT_demo.txt.gz")
CpG_OB <- file.path(demo.dir, "CpG_OB_demo.txt.gz")
head(read.table(CpG_OT, header=FALSE), 2)
#> V1
#> 1 GWNJ-0965:805:GW2108234048th:2:2108:26250:45136_1:N:0:ATTCCT:R1:CTTTCCAA:R2:GCTTCGAG:F1:TGACT:F2:AGACG
#> 2 GWNJ-0965:805:GW2108234048th:2:2108:26250:45136_1:N:0:ATTCCT:R1:CTTTCCAA:R2:GCTTCGAG:F1:TGACT:F2:AGACG
#> V2 V3 V4 V5
#> 1 + chr21 9438974 Z
#> 2 - chr21 9438990 z
head(read.table(CpG_OB, header=FALSE), 2)
#> V1
#> 1 GWNJ-0965:805:GW2108234048th:2:1114:27600:18502_1:N:0:ATTCCT:R1:GAAAATGA:R2:AAGATGCC:F1:TGACT:F2:TGACT
#> 2 GWNJ-0965:805:GW2108234048th:2:1114:27600:18502_1:N:0:ATTCCT:R1:GAAAATGA:R2:AAGATGCC:F1:TGACT:F2:TGACT
#> V2 V3 V4 V5
#> 1 + chr21 9498035 Z
#> 2 + chr21 9497961 Z
The output is a list in BED file format and/or written to an output BED file, where each line contains methylation states of all CpGs on the same fragment. Column methState is a sequence of binary values indicating the methylation states of all CpGs on the same fragment (0 represents unmethylated CpG and 1 represents methylated CpG).
methInfo <- MergeCpGs(CpG_OT, CpG_OB)
head(methInfo, 2)
#> chr cpgStart cpgEnd strand cpgNumber
#> 1 chr21 9439600 9439741 - 7
#> 2 chr21 9483487 9483622 - 16
#> cpgPosition
#> 1 9439600,9439636,9439678,9439685,9439717,9439725,9439741
#> 2 9483487,9483489,9483516,9483519,9483522,9483524,9483536,9483539,9483541,9483574,9483577,9483580,9483582,9483600,9483603,9483622
#> methState
#> 1 0111010
#> 2 0111011010011010
#> name
#> 1 GWNJ-0965:805:GW2108234048th:1:1101:11972:44486_1:N:0:ATTCCT:R1:TTACGGCG:R2:CAGTGGAG:F1:TGACT:F2:TGACT
#> 2 GWNJ-0965:805:GW2108234048th:1:1101:20933:67955_1:N:0:ATTCCT:R1:GGGACTTT:R2:CGCGAGTA:F1:TGACT:F2:TGACT
Function GenerateFragMeth()
combines the output lists of
MergePEReads()
and MergeCpGs()
into one list,
which contains both the fragment information and the methylation states
of all CpGs on each fragment.
fragMeth <- GenerateFragMeth(fragInfo, methInfo)
head(fragMeth, 2)
#> chr start end
#> 1 chr21 9417768 9418223
#> 2 chr21 9431278 9431614
#> name
#> 1 GWNJ-0965:805:GW2108234048th:1:2211:31223:60642_1:N:0:ATTCCT:R1:CAAGTCCC:R2:ACATTATG:F1:TGACT:F2:AGACT
#> 2 GWNJ-0965:805:GW2108234048th:1:1106:4411:3542_1:N:0:ATTCCT:R1:CATGTGCC:R2:AGCACTAA:F1:TGACT:F2:ACACT
#> fragmentLength strand cpgNumber cpgPosition
#> 1 455 - 2 9418124,9418171
#> 2 336 + 5 9431279,9431386,9431537,9431554,9431596
#> methState
#> 1 01
#> 2 00111
Function GenerateMarkerParam()
calculates paired shape
parameters of beta distributions for each marker. There are three main
inputs to this function: (1) a list of methylation levels (e.g., beta
values), where each row is a sample and each column is a marker; (2) a
vector of sample types (e.g., tumor or normal, tissue types)
corresponding to the rows of the list; (3) a vector of marker names
corresponding to the columns of the list.
methLevel <- read.table(file.path(demo.dir, "beta_matrix.txt.gz"),
row.names=1, header = TRUE)
sampleTypes <- read.table(file.path(demo.dir, "sample_type.txt.gz"),
row.names=1, header = TRUE)$sampleType
markerNames <- read.table(file.path(demo.dir, "marker_index.txt.gz"),
row.names=1, header = TRUE)$markerIndex
print(methLevel)
#> marker1 marker2 marker3
#> sample1 0.8967302 0.9444779 0.9274943
#> sample2 0.6807593 0.9391826 0.8329514
#> sample3 0.8407675 0.9312582 0.8824475
#> sample4 0.8614165 0.9571972 0.8170904
#> sample5 0.8685513 0.9138511 0.8993120
#> sample6 0.5926250 0.9626752 0.8629679
#> sample7 0.8382709 0.9145921 0.2179631
#> sample8 0.8554025 0.9510139 0.8368095
#> sample9 0.8286787 0.9376723 0.8830746
#> sample10 0.8362881 0.9286860 0.8326099
#> sample11 0.8782031 0.9801491 0.9095433
#> sample12 0.6815395 0.9093327 0.8749625
#> sample13 0.8751985 0.9527680 0.8362279
#> sample14 0.7974884 0.9184259 0.9163573
#> sample15 0.8785107 0.8995657 0.8645149
#> sample16 0.8159900 0.9719771 0.8779748
#> sample17 0.4364249 0.8534525 0.3489145
#> sample18 0.8701227 0.9525218 0.8704187
#> sample19 0.8737151 0.9037688 0.8957254
#> sample20 0.8572883 0.9467923 0.8380768
print(sampleTypes)
#> [1] "normal" "tumor" "tumor" "normal" "normal" "tumor" "tumor" "normal"
#> [9] "normal" "tumor" "normal" "tumor" "normal" "tumor" "tumor" "normal"
#> [17] "tumor" "normal" "tumor" "normal"
print(markerNames)
#> [1] 1 2 3
The output is a list containing the paired shape parameters of beta
distributions for markers, which are delimited by :
. Users
can save this list into a file with column names, no row names, and
columns are delimited by TAB for later use.
markerParam <- GenerateMarkerParam(methLevel, sampleTypes, markerNames)
print(markerParam)
#> markerName normal tumor
#> 1 1 183.74:29.723 5.955:2.032
#> 2 2 134.584:6.958 83.653:7.662
#> 3 3 72.949:10.939 1.476:0.484
Note that parameter value NA:NA
or 0:0
may
cause errors in the following analyses. Remove these values before using
this file.
To make the computation more efficient, users may only keep the
fragments that overlap with the genomic regions of markers. Here, we
provide an example of using R
package GenomicRanges
to perform the intersection.
First, transform the two BED files into GRanges classes.
library(GenomicRanges)
# a BED file of fragment-level methylation information
frag_bed <- read.csv(file.path(demo.dir, "demo.fragment_level.meth.bed.txt.gz"),
header=TRUE, sep="\t")
frag_meth.gr <- GRanges(seqnames=frag_bed$chr,
ranges=IRanges(frag_bed$start, frag_bed$end),
strand=frag_bed$strand,
methState=as.character(frag_bed$methState))
# a BED file of genomic regions of markers
markers_bed <- read.csv(file.path(demo.dir, "markers.bed.txt.gz"),
header=TRUE, sep="\t")
markers.gr <- GRanges(seqnames=markers_bed$chr,
ranges=IRanges(markers_bed$start, markers_bed$end),
markerName=markers_bed$markerName)
head(frag_meth.gr, 2)
#> GRanges object with 2 ranges and 1 metadata column:
#> seqnames ranges strand | methState
#> <Rle> <IRanges> <Rle> | <character>
#> [1] chr21 9417768-9418223 - | 1
#> [2] chr21 9431278-9431614 + | 111
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
head(markers.gr, 2)
#> GRanges object with 2 ranges and 1 metadata column:
#> seqnames ranges strand | markerName
#> <Rle> <IRanges> <Rle> | <integer>
#> [1] chr21 9418124-9418171 * | 1
#> [2] chr21 9437462-9437533 * | 2
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
Then, overlap two GRanges classes and get the fragment-level methylation states intersecting with the markers.
ranges <- subsetByOverlaps(frag_meth.gr, markers.gr, ignore.strand=TRUE)
hits <- findOverlaps(frag_meth.gr, markers.gr,ignore.strand=TRUE)
idx <- subjectHits(hits)
values <- DataFrame(markerName=markers.gr$markerName[idx])
mcols(ranges) <- c(mcols(ranges), values)
marker.methState <- as.data.frame(cbind(ranges$markerName,
ranges$methState))
colnames(marker.methState) <- c("markerName", "methState")
head(marker.methState, 4)
#> markerName methState
#> 1 1 1
#> 2 2 10111101110
#> 3 2 101100011010
#> 4 2 1011000111001
CancerDetector()
Function CancerDetector()
separates cfDNA into
tumor-derived fragments and background normal fragments and estimates
the tumor burden. The main inputs are two files: (1) the fragment-level
methylation states of reads (column methState
) that mapped
to the cancer-specific markers; (2) paired shape parameters of beta
distributions for cancer-specific markers. All columns are delimited by
TAB, and the first line is the column names. In addition, users can tune
the parameter lambda
to adjust the relative level of tumor
burden.
fragMethFile <- file.path(demo.dir, "CancerDetector.reads.txt.gz")
markerParamFile <- file.path(demo.dir, "CancerDetector.markers.txt.gz")
head(read.csv(fragMethFile, sep = "\t", colClasses = "character"), 4)
#> markerName methState
#> 1 2 000
#> 2 2 000
#> 3 2 000
#> 4 2 111111
head(read.csv(markerParamFile, sep = "\t"), 4)
#> markerName tumor normalPlasma
#> 1 2 5:3 97:12
#> 2 3 3:2 57:9
#> 3 4 4:4 24:3
#> 4 5 3:3 32:5
The output is the estimated tumor burden θ and the normal cfDNA fraction 1 − θ.
cfDeconvolve()
Function cfDeconvolve()
estimates fractions of cfDNA
fragments from different tissues (> 2 classes). The main two input
files are similar to function CancerDetector()
: (1) the
fragment-level methylation states of reads (column
methState
) that mapped to the tissue-specific markers; (2)
paired shape parameters of beta distributions for tissue-specific
markers. All columns are delimited by TAB, and there is a header line of
the column names.
fragMethFile2 <- file.path(demo.dir, "cfDeconvolve.reads.txt.gz")
markerParamFile2 <- file.path(demo.dir, "cfDeconvolve.markers.txt.gz")
head(read.csv(fragMethFile2, header=TRUE, sep="\t",
colClasses = "character"), 4)
#> markerName methState
#> 1 2 1111100
#> 2 2 0100110
#> 3 2 0111111
#> 4 2 1111111
head(read.csv(markerParamFile2, header=TRUE, sep="\t",
colClasses = "character"), 4)
#> markerName tissue1 tissue2 tissue3 tissue4 tissue5 tissue6
#> 1 2 1.68:1.82 2:18 1.02:0.459 2.08:18 0.061 0.176
#> 2 27 0.605 0.47 0.667 29.1:12.1 0.0528 0.542
#> 3 61 12.8:2.68 9.35:1.75 19.4:4.06 6.28:7.06 8.44:20.4 21.1:3.48
#> 4 63 4.71:2.49 13.5:2.17 55.3:8.53 0.906 31.3:3.8 5.46:3.17
#> tissue7
#> 1 6.65:2.19
#> 2 7.31:3.47
#> 3 6.69:4.01
#> 4 15:3.3
Other input parameters are:
em.global.unknown
(default),
em.global.known
, em.local.unknown
,
em.local.known
.For example:
numTissues <- 7
emAlgorithmType <- "em.global.unknown"
likelihoodRatioThreshold <- 2
emMaxIterations <- 100
randomSeed <- 0
id <- "test"
The output is a list containing the cfDNA fractions of different tissue types and an unknown class.
tissueFraction <- cfDeconvolve(fragMethFile2, markerParamFile2, numTissues,
emAlgorithmType, likelihoodRatioThreshold,
emMaxIterations, randomSeed, id)
tissueFraction
#> tissue1 tissue2 tissue3 tissue4 tissue5 tissue6 tissue7
#> 1 1.58246e-13 1.35298e-19 0.0442296 1.45249e-21 0.0157226 4.00821e-18 0.930494
#> unknown
#> 1 0.00955414
cfSort()
Function cfSort()
estimates fractions of cfDNA fragments
derived from 29 major human tissues. It is the first supervised tissue
deconvolution approach with deep learning models. The main input file is
similar to function CancerDetector()
and
cfDeconvolve()
: the fragment-level methylation states of
reads (column methState
) that mapped to the tissue-specific
markers. The first column is the marker name of cfSort markers.
fragMethInfo <- file.path(demo.dir, "cfsort_reads.txt.gz")
head(read.csv(fragMethInfo, header=TRUE, sep="\t",
colClasses = "character"), 4)
#> markerName methState
#> 1 2 1110101
#> 2 2 1111110
#> 3 2 1111101
#> 4 2 1111111
The output is a list containing the cfDNA fractions of 29 tissue types.
tissueFraction2 <- cfSort(fragMethInfo, id="demo")
tissueFraction2
#> adipose_tissue adrenal_gland bladder blood_vessel breast cervix_uteri colon
#> 1 0 0 0 0 0 0 0
#> esophagus fallopian_tube heart kidney liver lung muscle nerve ovary pancreas
#> 1 0 0 0 0 0 0 0 0 0 0
#> pituitary prostate salivary_gland skin small_intestine spleen stomach testis
#> 1 0 0 0 0 0 0 0 0
#> thyroid uterus vagina WBC
#> 1 0 0 0 1
[1] Li W, Li Q, Kang S, Same M, Zhou Y, Sun C, Liu CC, Matsuoka L, Sher L, Wong WH, Alber F, Zhou XJ. CancerDetector: ultrasensitive and non-invasive cancer detection at the resolution of individual reads using cell-free DNA methylation sequencing data. Nucleic Acids Res. 2018 Sep 6;46(15):e89. doi: 10.1093/nar/gky423. PMID: 29897492; PMCID: PMC6125664.
[2] Del Vecchio G, Li Q, Li W, Thamotharan S, Tosevska A, Morselli M, Sung K, Janzen C, Zhou X, Pellegrini M, Devaskar SU. Cell-free DNA methylation and transcriptomic signature prediction of pregnancies with adverse outcomes. Epigenetics. 2021 Jun;16(6):642-661. doi: 10.1080/15592294.2020.1816774. PMID: 33045922; PMCID: PMC8143248.
[3] Li S, Zeng W, Ni X, Liu Q, Li W, Stackpole ML, Zhou Y, Gower A, Krysan K, Ahuja P, Lu DS, Raman SS, Hsu W, Aberle DR, Magyar CE, French SW, Han SB, Garon EB, Agopian VG, Wong WH, Dubinett SM, Zhou XJ. Comprehensive tissue deconvolution of cell-free DNA by deep learning for disease diagnosis and monitoring. Proc Natl Acad Sci U S A. 2023 Jul 11;120(28):e2305236120. doi: 10.1073/pnas.2305236120. Epub 2023 Jul 3. PMID: 37399400.
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] cfToolsData_1.4.0 GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
#> [4] IRanges_2.41.2 S4Vectors_0.45.2 BiocGenerics_0.53.3
#> [7] generics_0.1.3 cfTools_1.7.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] KEGGREST_1.47.0 dir.expiry_1.15.0 xfun_0.50
#> [4] bslib_0.8.0 Biobase_2.67.0 lattice_0.22-6
#> [7] vctrs_0.6.5 tools_4.4.2 curl_6.1.0
#> [10] parallel_4.4.2 AnnotationDbi_1.69.0 tibble_3.2.1
#> [13] RSQLite_2.3.9 blob_1.2.4 pkgconfig_2.0.3
#> [16] R.oo_1.27.0 Matrix_1.7-1 dbplyr_2.5.0
#> [19] lifecycle_1.0.4 GenomeInfoDbData_1.2.13 compiler_4.4.2
#> [22] Biostrings_2.75.3 htmltools_0.5.8.1 sys_3.4.3
#> [25] buildtools_1.0.0 sass_0.4.9 yaml_2.3.10
#> [28] crayon_1.5.3 pillar_1.10.1 jquerylib_0.1.4
#> [31] R.utils_2.12.3 cachem_1.1.0 mime_0.12
#> [34] ExperimentHub_2.15.0 AnnotationHub_3.15.0 basilisk_1.19.0
#> [37] tidyselect_1.2.1 digest_0.6.37 purrr_1.0.2
#> [40] dplyr_1.1.4 BiocVersion_3.21.1 maketools_1.3.1
#> [43] fastmap_1.2.0 grid_4.4.2 cli_3.6.3
#> [46] magrittr_2.0.3 withr_3.0.2 filelock_1.0.3
#> [49] UCSC.utils_1.3.1 rappdirs_0.3.3 bit64_4.6.0-1
#> [52] rmarkdown_2.29 XVector_0.47.2 httr_1.4.7
#> [55] bit_4.5.0.1 reticulate_1.40.0 png_0.1-8
#> [58] R.methodsS3_1.8.2 memoise_2.0.1 evaluate_1.0.3
#> [61] knitr_1.49 basilisk.utils_1.19.0 BiocFileCache_2.15.0
#> [64] rlang_1.1.4 Rcpp_1.0.14 glue_1.8.0
#> [67] DBI_1.2.3 BiocManager_1.30.25 jsonlite_1.8.9
#> [70] R6_2.5.1