Tutorial of the SARC R Package. Flagging the confidence of CNVs from WES/ WGS cohorts.

SARC

Statistical Analysis of Regions with CNVs

This tools was designed to evaluate Copy number variations (CNVs) which have been detected from CNV detection pipelines from NGS data. CNV detection pipelines from WES and (to a lesser extent) WGS, lead to high numbers of false-positives.

The SARC package aims to flag high confidence and low confidence CNVs from such detection pipelines. This would aid in patient diagnostics and clinical work. Uniquely, the SARC package only requires a coverage/cov file and an cnv file. Both files should ideally be loaded as data.frames.

The cov file can be created from many BAM files from WES/ WGS, and the read depth coverage of each sample should be normalized by library size of the sample prior to processing. The end of the vignette will point to how to perform read coverage reading and normalisation with examples.

The cnv file is in many ways similar to a standard CNV BED file - as it serves as a list of CNVs identified in the samples. Unlike a traditional BED file, the cnv file should contain additional columns such as CNV value, type of CNV, genes and batch of data. Additionally, the column names of the cnv file should ideally include: SAMPLE, CHROM, START, END, TYPE, VALUE. The order of the columns is not necessary. Additional columns such as TOOL, BATCH and GENE can enhance plotting.

This is not a detection tool - it is a downstream ratification tool for use AFTER a CNV detection pipeline!

#INSTALL BiocManager::install("SARC")
#load library
library(SARC)

Data set-up

The cnv file should contain generic information at the minimum. Sample names (SAMPLE), Chromosome (“CHROM”), Start site (“START”), End sit (“END”), Deletion or Duplication (“TYPE”) and CNV value (“VALUE”) are required for the tool to function. Column names should be in all caps and match the names of the test_cnv example below.

START, END and VALUE should be integers!

data("test_cnv")
#For speed just use a few detected CNVs
test_cnv <- test_cnv[1:3,]
head(test_cnv) %>%
  kable %>%
  kable_styling("striped", full_width=FALSE) %>%
  scroll_box(width = "800px", height = "200px")
SAMPLE CHROM START END TYPE VALUE TOOL
SampleA chr1 161487763 161565521 DUP 3 exomeDepth
SampleB chr1 17034387 17090909 DEL 0 exomeDepth
SampleB chr1 17034387 17090909 DEL 0 clinCNV

The cov file should comprise of coverage from normalised BAM files. Additionally, it is good practice to separate cov files by the technology used to sequence the FASTQ files.

Importantly, there will be four columns before the samples - ID, Chromosome, Start and End. ID, Start and End should be integers. The rest of the column names will be the names of the samples - and these should match the samples names found in the cnv file.

data("test_cov")
head(test_cov[, 1:10]) %>%
  kable %>%
  kable_styling("striped", full_width=FALSE) %>%
  scroll_box(width = "800px", height = "200px")
ID CHROM START END SampleA SampleB SampleC SampleD SampleE SampleF
1 chr1 13402 13639 21.96 23.40 24.82 20.91 13.83 20.21
2 chr1 69088 69091 3.61 3.31 3.96 2.95 2.36 2.82
3 chr1 70008 70010 1.93 2.69 2.42 2.58 2.36 2.31
4 chr1 324438 325605 56.74 56.57 57.26 51.04 53.39 44.49
5 chr1 721404 721912 5.62 10.10 9.99 6.51 7.41 5.26
6 chr1 861393 861395 1.56 1.71 1.73 1.77 1.66 1.84

All samples in the cnv file should be found in the cov file. But you can have samples in the cov file that are not found in the cnv file.

The SARC package relies on RaggedExperiments objects to store the cov file as its assay and will store lists of dataframes and Grange objects within metadata. The SARC tool will generate more cnv dataframes with additional stats for each CNV in the list of CNVs. These dataframes will be stored within the first list stored in the metadata.

For cohorts with multiple coverage files (from multiple sequencing platforms or meta-analyses studies), we recommend creating multiple RaggedExperiments objects, rather than combining them all into one.

#Create a start site and end site for each CNV detected
SARC <- regionSet(cnv = test_cnv, test_cov)

#Create smaller coverage files for each CNV
SARC <- regionSplit(RE = SARC, #the raggedexpression object we made
                    cnv = metadata(SARC)[['CNVlist']][[1]], 
                    startlist = metadata(SARC)[[2]], #list of start sites, per CNV
                    endlist = metadata(SARC)[[3]]) #list of end sites, per CNV

Statistical analysis

First we use mean scores to check if the mean reads at this region matches the CNV values from a CNV detection tool. This will make a new cnv file which will be stored in the RE object.

#Calculate the mean read coverage
SARC <- regionMean(RE = SARC, 
                   cnv = test_cnv, 
                   splitcov =  metadata(SARC)[[4]]) #list of cnv specific coverage

We then calculate the quantile distributions. We assume a true duplication will be on the higher end of the distribution (in contrast to the other samples) and true deletions will be on the lower end. Depending on the number of samples in the COV file - the thresholds should be altered.

#Calculate the distribution of the mean reads
SARC <- regionQuantiles(RE = SARC, 
                        cnv = metadata(SARC)[['CNVlist']][[2]], #new cnv file
                        meancov = metadata(SARC)[[5]], #list of cnv specific coverage + means
                        q1 = 0.1, #lower threshold
                        q2 = 0.9) #upper threshold

Anova can then be used to identify if a region with a suspected CNV has a significantly rare read depth at the region - in contrast to all other samples. The more samples, the more powerful this test is.

#Calculate rarity of each suspected CNV - in contrast to other samples
SARC <- prepAnova(RE = SARC, 
                  start = metadata(SARC)[[2]], 
                  end = metadata(SARC)[[3]], 
                  cnv = metadata(SARC)[['CNVlist']][[3]]) #newest cnv dataframe

SARC <- anovaOnCNV(RE = SARC, 
                   cnv = metadata(SARC)[['CNVlist']][[3]], 
                   anovacov = metadata(SARC)[[8]])

head(metadata(SARC)[['CNVlist']][[4]]) %>%
  kable %>%
  kable_styling("striped", full_width=FALSE) %>%
  scroll_box(width = "800px", height = "200px")
SAMPLE CHROM START END TYPE VALUE TOOL MeanScore Qlow Qhigh ANOVA
SampleA chr1 161487763 161565521 DUP 3 exomeDepth 1.44 1 0.0927248
SampleB chr1 17034387 17090909 DEL 0 exomeDepth 0.64 0 0.0842323
SampleB chr1 17034387 17090909 DEL 0 clinCNV 0.64 0 0.0842323

Plotting

A major complaint of CNV analysis by diagnostic staff is the over-reliance on the Integrative Genome Browser (IGV). While a great tool, it can be a tedious task to search many hundreds of samples manually. The SARC package provides an alternative (but not a complete substitute) to visualize which genes and exons are being (or not being) effected by a CNV. This is also a good way of visualizing the false-positives quickly, without using IGV.

#prepare new objects for the CNV plots to work
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
library(Homo.sapiens) #load genome specific libraries from BioConductor
TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.knownGene
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
tx <- transcriptsBy(Homo.sapiens, columns = "SYMBOL")
txgene <- tx@unlistData

SARC <- plotCovPrep(RE = SARC,
                    cnv = metadata(SARC)[['CNVlist']][[4]], #newest cnv dataframe
                    startlist = metadata(SARC)[[2]],
                    endlist = metadata(SARC)[[3]],
                    n1 = 0, #left-padding
                    n2 = 0) #right-padding

SARC <- regionGrangeMake(RE = SARC, 
                         covprepped = metadata(SARC)[[9]])

SARC <- addExonsGenes(RE = SARC, 
                      covgranges = metadata(SARC)[[10]], #list of grange objects - one for each detected CNV
                      txdb = txdb, #Species specific database
                      txgene = txgene) #genes/ exons from the database

SARC <- setupCNVplot(RE = SARC, namedgranges =  metadata(SARC)[[11]], #grange objects which have genes/ exons added
                   covprepped = metadata(SARC)[[9]])
## [1] "All CNVs are fine for further evaluation"

If CNVs are very small and cannot to attributed to at least one known gene in the TxDB object they will be too difficult to process any further. A message will state which CNVs were not associated with any genes. These CNVs should be removed from the cnv file also as to not cause errors.

Each detected CNV can be plotted. Automated plotting can be done quite simply with a for-loop - and is recommended. The samples with the detected CNV is highlighted with a thin purple line. The Sample with the detected CNV, the type of CNV and the value of the CNV from the detection pipeline are pasted as the subtitle.

#Calculate rarity of each suspected CNV - in contrast to other samples
plotCNV(cnv = metadata(SARC)[['CNVlist']][[4]],
        setup = metadata(SARC)[[12]],
        FilteredCNV = 1)

Flagging

Some clinicians would rather the confidence level flagged rather than completely removed - so SARC will not remove any CNVs, and this is left to the user.

#Use statistical analyses to flag CNVs as high or low confidence of being true
SARC <- cnvConfidence(RE = SARC, cnv = metadata(SARC)[['CNVlist']][[4]])
head(metadata(SARC)[['CNVlist']][[5]]) %>%
  kable %>%
  kable_styling("striped", full_width=FALSE) %>%
  scroll_box(width = "800px", height = "200px")
SAMPLE CHROM START END TYPE VALUE TOOL MeanScore Qlow Qhigh ANOVA CNV.SCORE CNV.TIER
SampleA chr1 161487763 161565521 DUP 3 exomeDepth 1.44 1 0.0927248 2 CONFIDENT
SampleB chr1 17034387 17090909 DEL 0 exomeDepth 0.64 0 0.0842323 0 VERY UNCONFIDENT
SampleB chr1 17034387 17090909 DEL 0 clinCNV 0.64 0 0.0842323 0 VERY UNCONFIDENT

Extras

If the RaggedExperiment Object is confusing it can be traversed as so.

#This will show all the dataframes (cnv) files made
#Generally it is recommended to use the most recently made cnv file to keep all the additional columns
print(SARC) 
## class: RaggedExperiment 
## dim: 34987 1 
## assays(21): ID SampleA ... SampleS SampleT
## rownames: NULL
## colnames: NULL
## colData names(1): X
#This will show all the list objects. Their names should roughly correlate with the names of the parameters in the functions.
names(metadata(SARC)) 
##  [1] "CNVlist"       "REGIONSTART"   "REGIONEND"     "SPLITCOV"     
##  [5] "SPLITMEAN"     "LOWERQAUNTILE" "UPPERQAUNTILE" "ANOVACOV"     
##  [9] "COVPREPPED"    "COVGRANGES"    "NAMEDGRANGES"  "CNVPLOTLIST"

Plot distributions to see why some CNVs were false positives. The sample with the suspected CNV will be highlighted in red. In cases where many samples are present, plotly=TRUE may lead to a clearer visual. sample refers to the CNV - so can easily be looped.

SARC <- setQDplot(RE = SARC, meancov = metadata(SARC)[[5]])
seeDist(meanList = metadata(SARC)[[13]],
        cnv = metadata(SARC)[['CNVlist']][[5]], 
        sample = 1,
        plotly=FALSE)

Add all genes and exons the CNVs effect. This could be useful to identify if the variant contributes to a patients symptoms.

SARC <- pasteExonsGenes(RE = SARC, 
                        setup =  metadata(SARC)[[12]], #list of dataframes from the setupCNVplot function
                        cnv = metadata(SARC)[['CNVlist']][[5]]) #cnv file to add an extra column to
                      
metadata(SARC)[['CNVlist']][[6]] %>%
  kable %>%
  kable_styling("striped", full_width=FALSE) %>%
  scroll_box(width = "800px", height = "200px")
SAMPLE CHROM START END TYPE VALUE TOOL MeanScore Qlow Qhigh ANOVA CNV.SCORE CNV.TIER Gene.Exon
SampleA chr1 161487763 161565521 DUP 3 exomeDepth 1.44 1 0.0927248 2 CONFIDENT FCGR2A.1&#124;FCGR2A.6&#124;FCGR2A.7&#124;FCGR3A.1
SampleB chr1 17034387 17090909 DEL 0 exomeDepth 0.64 0 0.0842323 0 VERY UNCONFIDENT SDHB.3&#124;SDHB.1&#124;PADI2.9&#124;PADI2.8&#124;PADI2.7&#124;PADI2.6
SampleB chr1 17034387 17090909 DEL 0 clinCNV 0.64 0 0.0842323 0 VERY UNCONFIDENT SDHB.3&#124;SDHB.1&#124;PADI2.9&#124;PADI2.8&#124;PADI2.7&#124;PADI2.6

A more powerful test is the Dunnet test. This contrasts the read-depths between the control samples (samples with suspected CNVs) and the test samples (samples without suspected CNVs) at the same region of the DNA. However this is very slow - and it is only recommended when there are few samples (<100) to test.

DNA <- phDunnetonCNV(RE = DNA, 
                     cnv = metadata(SARC)[['CNVlist']][[4]], 
                     anovacov = metadata(SARC)[[8]])
SARC <- cnvConfidence(MA = SARC, cnv = experiments(SARC)[[7]], ph = TRUE)

plotCNV can also hone into a single GENE of interest, so long as it matches the genes found via TxDB, can be made specific for a BATCH of WES/ WGS data, or sites where the DNA was sequenced, and be logged. For WES data, logging is rerecorded as across a short region of DNA, the reads can change greatly, based on the chemistry of the sequencer. For deletions, log 10 is helpful, for duplication’s it can actually make them harder to see. Number of samples can also be adjusted for plotting.

The plots can be automated easily in a loop. Trick - make one plot in Rstudio, keep it in the plotting pane, and adjust the height and width. All other plots made in a loop with ggsave will be made to the same height and width. Useful when making plots for papers.

data("test_cnv2")
library(ggplot2)
for (i in seq_len(nrow(test_cnv2))) {
  n <- paste0(cnv$SAMPLE[i], "_",
              cnv$GENE[i], "_",
              cnv$CHROM[i], "_",
              cnv$START[i], "_",
              cnv$END[i])
  savename <- paste0(n, ".jpeg")
  print(i)
  plotCNV(cnv = cnv, setup =  metadata(X)[[9]], FilteredCNV = i, 
          batch = cnv$BATCH[i], gene = cnv$GENE[i])
  ggplot2::ggsave(filename = savename)
}

Alternate species

Though this tool is designed for clinical work, it can be further applied for research projects which focus on other species. Simply change the txdb and species database. For example, for mouse based projects, follow the following steps.

require(TxDb.Mmusculus.UCSC.mm10.knownGene)
## Loading required package: TxDb.Mmusculus.UCSC.mm10.knownGene
require(Mus.musculus) # load genome specific libraries from BioConductor
## Loading required package: Mus.musculus
## Loading required package: org.Mm.eg.db
## 
#prepare new objects for the CNV plots to work
TxDb(Mus.musculus) <- TxDb.Mmusculus.UCSC.mm10.knownGene
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
tx <- transcriptsBy(Mus.musculus, columns = "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
txgene <- tx@unlistData

Obtaining Read Depth

Using bioconductor package GenomicAlignments users can read the coverage of bam files. Here we give some examples on how to read bam files into R. Further detail is available in this page.

require(GenomicAlignments)
bam <- system.file("extdata", "test.bam", package="SARC")
coverage <- coverage(bam)
coverage$chr1@values
##  [1] 0 1 0 1 2 3 2 1 0 1 2 3 2 1 0 1 2 1 0

Normalising by Library size

It is highly recommended to normalize read depth by total library size of the sequencing file for more accurate analysis. Calculating RPM (reads per million) is a straight forward method for genomic data. Here is an example using raw read depth and total library sizes.

load(system.file("extdata", "test_cov_raw.rda", package="SARC"))
load(system.file("extdata", "librarySizes.rda", package="SARC"))

normalised_counts <- t(t(test_cov[, 5:ncol(test_cov)]) / libsizes$PerMillion)
normalised_counts <- cbind(test_cov[,1:4], normalised_counts)

Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] GenomicAlignments_1.43.0                 
##  [2] Rsamtools_2.23.1                         
##  [3] Biostrings_2.75.3                        
##  [4] XVector_0.47.0                           
##  [5] SummarizedExperiment_1.37.0              
##  [6] MatrixGenerics_1.19.0                    
##  [7] matrixStats_1.4.1                        
##  [8] Mus.musculus_1.3.1                       
##  [9] org.Mm.eg.db_3.20.0                      
## [10] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0
## [11] Homo.sapiens_1.3.1                       
## [12] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2  
## [13] org.Hs.eg.db_3.20.0                      
## [14] GO.db_3.20.0                             
## [15] OrganismDbi_1.49.0                       
## [16] TxDb.Hsapiens.UCSC.hg38.knownGene_3.20.0 
## [17] GenomicFeatures_1.59.1                   
## [18] AnnotationDbi_1.69.0                     
## [19] Biobase_2.67.0                           
## [20] kableExtra_1.4.0                         
## [21] knitr_1.49                               
## [22] SARC_1.5.0                               
## [23] RaggedExperiment_1.31.1                  
## [24] GenomicRanges_1.59.1                     
## [25] GenomeInfoDb_1.43.2                      
## [26] IRanges_2.41.2                           
## [27] S4Vectors_0.45.2                         
## [28] BiocGenerics_0.53.3                      
## [29] generics_0.1.3                           
## [30] rmarkdown_2.29                           
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.4.2           BiocIO_1.17.1           bitops_1.0-9           
##   [4] filelock_1.0.3          tibble_3.2.1            cellranger_1.1.0       
##   [7] graph_1.85.0            XML_3.99-0.17           lifecycle_1.0.4        
##  [10] httr2_1.0.7             Rdpack_2.6.2            lattice_0.22-6         
##  [13] MASS_7.3-61             magrittr_2.0.3          plotly_4.10.4          
##  [16] sass_0.4.9              jquerylib_0.1.4         yaml_2.3.10            
##  [19] plotrix_3.8-4           qqconf_1.3.2            sn_2.1.1               
##  [22] gld_2.6.6               DBI_1.2.3               buildtools_1.0.0       
##  [25] RColorBrewer_1.1-3      multcomp_1.4-26         abind_1.4-8            
##  [28] zlibbioc_1.52.0         expm_1.0-0              purrr_1.0.2            
##  [31] RCurl_1.98-1.16         TH.data_1.1-2           rappdirs_0.3.3         
##  [34] sandwich_3.1-1          GenomeInfoDbData_1.2.13 maketools_1.3.1        
##  [37] TFisher_0.2.0           svglite_2.1.3           codetools_0.2-20       
##  [40] DelayedArray_0.33.3     xml2_1.3.6              tidyselect_1.2.1       
##  [43] UCSC.utils_1.3.0        farver_2.1.2            BiocFileCache_2.15.0   
##  [46] mathjaxr_1.6-0          jsonlite_1.8.9          multtest_2.63.0        
##  [49] e1071_1.7-16            survival_3.8-3          systemfonts_1.1.0      
##  [52] tools_4.4.2             progress_1.2.3          DescTools_0.99.58      
##  [55] Rcpp_1.0.13-1           glue_1.8.0              BiocBaseUtils_1.9.0    
##  [58] mnormt_2.1.1            gridExtra_2.3           SparseArray_1.7.2      
##  [61] xfun_0.49               metap_1.11              dplyr_1.1.4            
##  [64] withr_3.0.2             numDeriv_2016.8-1.1     BiocManager_1.30.25    
##  [67] fastmap_1.2.0           boot_1.3-31             digest_0.6.37          
##  [70] R6_2.5.1                colorspace_2.1-1        biomaRt_2.63.0         
##  [73] RSQLite_2.3.9           tidyr_1.3.1             data.table_1.16.4      
##  [76] rtracklayer_1.67.0      class_7.3-22            prettyunits_1.2.0      
##  [79] httr_1.4.7              htmlwidgets_1.6.4       S4Arrays_1.7.1         
##  [82] pkgconfig_2.0.3         gtable_0.3.6            Exact_3.3              
##  [85] blob_1.2.4              sys_3.4.3               htmltools_0.5.8.1      
##  [88] RBGL_1.83.0             plyranges_1.27.0        scales_1.3.0           
##  [91] tidyverse_2.0.0         lmom_3.2                png_0.1-8              
##  [94] rstudioapi_0.17.1       reshape2_1.4.4          rjson_0.2.23           
##  [97] curl_6.0.1              proxy_0.4-27            cachem_1.1.0           
## [100] zoo_1.8-12              stringr_1.5.1           rootSolve_1.8.2.4      
## [103] parallel_4.4.2          restfulr_0.0.15         pillar_1.10.0          
## [106] grid_4.4.2              vctrs_0.6.5             dbplyr_2.5.0           
## [109] evaluate_1.0.1          mvtnorm_1.3-2           cli_3.6.3              
## [112] compiler_4.4.2          rlang_1.1.4             crayon_1.5.3           
## [115] mutoss_0.1-13           labeling_0.4.3          plyr_1.8.9             
## [118] forcats_1.0.0           stringi_1.8.4           viridisLite_0.4.2      
## [121] BiocParallel_1.41.0     txdbmaker_1.3.1         munsell_0.5.1          
## [124] lazyeval_0.2.2          Matrix_1.7-1            hms_1.1.3              
## [127] bit64_4.5.2             ggplot2_3.5.1           KEGGREST_1.47.0        
## [130] haven_2.5.4             rbibutils_2.3           memoise_2.0.1          
## [133] bslib_0.8.0             bit_4.5.0.1             readxl_1.4.3