The R package BoBafit is composed of four functions which allow the refit and the recalibration of copy number profile of tumor sample. In particular, the package was built to check, and possibly correct, the diploid regions. The wrong diploid region is a phenomenon that very often affects the profiles of samples with a very complex karyotype.
The principal and refitting function was named DRrefit
,
which - throughout a chromosome clustering method and a list of
unaltered chromosomes (chromosome list) - recalibrates the copy number
values. BoBafit also contains three secondary functions:
ComputeNormalChromosome
, which generates the chromosome
list; PlotChrCluster
, where is possible to visualize the
cluster; and Popeye
, which affixes its chromosomal arm to
each segment (see in “Data Preparation” vignette).
The package checks the diploid region assessment working on
pre-estimated segment information, as the copy number and their
position. We included a data set TCGA_BRCA_CN_segments
where are showed all the information necessary. The data correspond to
segments about 100 breast tumors samples obtained by the project
TCGA-BRCA (Tomczak, Czerwińska, and Wiznerowicz
2015). In the “Data Preparation” vingnette is shown how we
download and prepare the dataset for the following analysis.
## Warning: replacing previous import 'ggplot2::geom_segment' by
## 'ggbio::geom_segment' when loading 'BOBaFIT'
Once the dataset has been prepared, the next step is to generate the
chromosome list. The chromosome list is a vector containing all
chromosomal arm which are the least affected by SCNAs in the tumor
analyzed. Together with the clustering, the chromosome list is one the
operating principles to rewrite the diploid region. The list can be
manually created or by using the function
ComputeNormalChromosome
. We suggest these two
sequential steps to allow the right refit and recalibration of sample’s
diploid region:
ComputeNormalChromosome()
DRrefit()
Here we performed this analysis workflow on the dataset
TCGA_BRCA_CN_segments
described above.
The chromosome list is a vector specific for each tumor (type and
subtype) . The chromosome arms included in this list must be selected
based on how many CNA events they are subject to and how many times
their CN falls into a “diploid range”. According to this principle,
ComputeNormalChromosome write the chromosome list. The function
allows to set the chromosomal alteration rate
(tolerance_val
), which corresponds to a minimum percentage
of alterations that one wants to tolerate per arm.
With a little dataframe (less than 200 samples), we suggest an alteration rate of 5% (0.5) ; on the contrary, With a big dataframe (about 1000 samples), we suggest as maximum rate 20-25% (0.20-0.25) . The function input is a sample cohort with their segments.
Here we performed the function in the data set
TCGA_BRCA_CN_segments
, using an alteration rate of 25%.
[1] “10q” “12q” “15q” “2p” “2q” “3p” “4q” “9q”
Storing the result in the variable chr_list
, it will be
a vector containing the chromosomal arms which present an alteration
rate under the indicated tolerance_val
value.
The function also plots in the Viewer a histogram where is possible observe the chromosomal alteration rate of each chromosomal arms and which one have been selected in the chromosome list (blue bars). The tolerance value has been set at 0.25 (dotted line).
\end{kframe}\begin{adjustwidth}{}{0mm} \includegraphics[width=100%]{/tmp/RtmpCVG8sh/Rbuild1fd36967e590/BOBaFIT/vignettes/BOBaFIT_files/figure-html/chrlist plot-1} \end{adjustwidth} \begin{adjustwidth}{}{0mm} \includegraphics[width=100%]{/tmp/RtmpCVG8sh/Rbuild1fd36967e590/BOBaFIT/vignettes/BOBaFIT_files/figure-html/DRrefit_plot 1-1} \end{adjustwidth}\begin{adjustwidth}{}{0mm} \includegraphics[width=100%]{/tmp/RtmpCVG8sh/Rbuild1fd36967e590/BOBaFIT/vignettes/BOBaFIT_files/figure-html/DRrefit_plot 2-1} \end{adjustwidth}\begin{kframe}
Another accessory function is PlotChrCluster
. It can be
used to visualize the chromosomal cluster in a single sample or in a
sample cohort. The input data is always a .tsv file, as the data frame
TCGA_BRCA_CN_segments
. The option of
clust_method
argument are the same of
DRrefit
(“ward.D”, “ward.D2”, “single”, “complete”,
“average”, “mcquitty”, “median”, “centroid” and “kmeans”).
Cluster <- PlotChrCluster(segs = TCGA_BRCA_CN_segments,
clust_method = "ward.D2",
plot_output= TRUE)
We suggest to store the output on a variable (in this example we use
Cluster
) to view and possibly save the data frame
generated. The PlotCuster
will automatically save the plot
in the folder indicated by the variable path
of the
argument plot_path
.
In the PlotChrCluster
plot, the chromosomal arms are
labeled and colored according to the cluster they belong to. The y-axis
reports the arm CN.
The outputs report
summarizes the outcome of clustering
for each sample (fail or succeeded, the number of clusters), similar to
DRrefit report output. The second output, plot tables
, is a
list of data frames (one per sample) and reports in which clustering the
chromosomes of the sample have been placed.
head(Cluster$report)
#select plot table per sample
head(Cluster$plot_tables$`01428281-1653-4839-b5cf-167bc62eb147`)
sample | clustering | num_clust |
---|---|---|
01428281-1653-4839-b5cf-167bc62eb147 | SUCCEDED | 3 |
01bc5261-bf91-4f7b-a6b4-0e727c5e31d2 | SUCCEDED | 2 |
05afee4e-9acd-44f1-8a0c-ffa34d772b9c | SUCCEDED | 3 |
091f70c0-586a-49e8-a0e5-0b60caa72c1b | SUCCEDED | 3 |
0941a978-c8aa-467b-8464-9f979d1f0418 | SUCCEDED | 2 |
#select plot table per sample
knitr::kable(head(Cluster$plot_tables$`01428281-1653-4839-b5cf-167bc62eb147`))
chr | cluster | CN |
---|---|---|
1p | cluster1 | 1.670236 |
1q | cluster2 | 3.140345 |
2p | cluster1 | 1.906657 |
2q | cluster1 | 1.911449 |
3p | cluster1 | 1.996745 |
3q | cluster2 | 2.624881 |
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.1.4 BOBaFIT_1.11.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.3
## [3] rstudioapi_0.17.1 jsonlite_1.8.9
## [5] magrittr_2.0.3 GenomicFeatures_1.59.1
## [7] farver_2.1.2 rmarkdown_2.29
## [9] BiocIO_1.17.1 zlibbioc_1.52.0
## [11] vctrs_0.6.5 memoise_2.0.1
## [13] Rsamtools_2.23.1 RCurl_1.98-1.16
## [15] base64enc_0.1-3 tinytex_0.54
## [17] htmltools_0.5.8.1 S4Arrays_1.7.1
## [19] progress_1.2.3 curl_6.0.1
## [21] SparseArray_1.7.2 Formula_1.2-5
## [23] sass_0.4.9 bslib_0.8.0
## [25] htmlwidgets_1.6.4 httr2_1.0.7
## [27] plyr_1.8.9 cachem_1.1.0
## [29] buildtools_1.0.0 GenomicAlignments_1.43.0
## [31] lifecycle_1.0.4 pkgconfig_2.0.3
## [33] Matrix_1.7-1 R6_2.5.1
## [35] fastmap_1.2.0 GenomeInfoDbData_1.2.13
## [37] MatrixGenerics_1.19.0 digest_0.6.37
## [39] colorspace_2.1-1 GGally_2.2.1
## [41] AnnotationDbi_1.69.0 S4Vectors_0.45.2
## [43] OrganismDbi_1.49.0 Hmisc_5.2-0
## [45] GenomicRanges_1.59.1 RSQLite_2.3.8
## [47] labeling_0.4.3 filelock_1.0.3
## [49] fansi_1.0.6 polyclip_1.10-7
## [51] httr_1.4.7 abind_1.4-8
## [53] compiler_4.4.2 withr_3.0.2
## [55] bit64_4.5.2 htmlTable_2.4.3
## [57] backports_1.5.0 BiocParallel_1.41.0
## [59] DBI_1.2.3 ggstats_0.7.0
## [61] ggforce_0.4.2 biomaRt_2.63.0
## [63] MASS_7.3-61 rappdirs_0.3.3
## [65] DelayedArray_0.33.2 rjson_0.2.23
## [67] tools_4.4.2 foreign_0.8-87
## [69] nnet_7.3-19 glue_1.8.0
## [71] restfulr_0.0.15 grid_4.4.2
## [73] checkmate_2.3.2 cluster_2.1.6
## [75] reshape2_1.4.4 generics_0.1.3
## [77] gtable_0.3.6 BSgenome_1.75.0
## [79] tidyr_1.3.1 ensembldb_2.31.0
## [81] hms_1.1.3 data.table_1.16.2
## [83] xml2_1.3.6 utf8_1.2.4
## [85] XVector_0.47.0 BiocGenerics_0.53.3
## [87] pillar_1.9.0 stringr_1.5.1
## [89] tweenr_2.0.3 BiocFileCache_2.15.0
## [91] lattice_0.22-6 rtracklayer_1.67.0
## [93] bit_4.5.0 biovizBase_1.55.0
## [95] RBGL_1.83.0 tidyselect_1.2.1
## [97] maketools_1.3.1 Biostrings_2.75.1
## [99] knitr_1.49 gridExtra_2.3
## [101] ggbio_1.55.0 IRanges_2.41.1
## [103] ProtGenerics_1.39.0 SummarizedExperiment_1.37.0
## [105] stats4_4.4.2 xfun_0.49
## [107] Biobase_2.67.0 matrixStats_1.4.1
## [109] stringi_1.8.4 UCSC.utils_1.3.0
## [111] lazyeval_0.2.2 yaml_2.3.10
## [113] evaluate_1.0.1 codetools_0.2-20
## [115] NbClust_3.0.1 tibble_3.2.1
## [117] graph_1.85.0 BiocManager_1.30.25
## [119] cli_3.6.3 rpart_4.1.23
## [121] munsell_0.5.1 jquerylib_0.1.4
## [123] dichromat_2.0-0.1 Rcpp_1.0.13-1
## [125] GenomeInfoDb_1.43.2 dbplyr_2.5.0
## [127] png_0.1-8 XML_3.99-0.17
## [129] parallel_4.4.2 ggplot2_3.5.1
## [131] blob_1.2.4 prettyunits_1.2.0
## [133] AnnotationFilter_1.31.0 plyranges_1.27.0
## [135] bitops_1.0-9 txdbmaker_1.3.1
## [137] VariantAnnotation_1.53.0 scales_1.3.0
## [139] purrr_1.0.2 crayon_1.5.3
## [141] rlang_1.1.4 KEGGREST_1.47.0