Data preparation using TCGA-BRCA database

Introduction

The data preparation is an important step before the BOBaFIT analysis. In this vignete we will explain how to download the TCGA-BRCA dataset(Tomczak, Czerwińska, and Wiznerowicz 2015) the package TCGAbiolink (Colaprico et al. 2015) and how to add information like chromosomal arm and CN value of each segments, which operating principle of the package. Further, here we show the column names of the input file for all the BOBaFIT function.

Download from TCGA

To download the TCGCA-BRCA(Tomczak, Czerwińska, and Wiznerowicz 2015), we used the R package TCGAbiolinks (Colaprico et al. 2015)and we construct the query. The query includs Breast Cancer samples analyzed by SNParray method (GenomeWide_SNP6), obtaining their Copy Number (CN) profile.

BiocManager::install("TCGAbiolinks")
library(TCGAbiolinks)

query <- GDCquery(project = "TCGA-BRCA",
                  data.category = "Copy Number Variation",
                  data.type = "Copy Number Segment",
                  sample.type = "Primary Tumor"
                  )

#Selecting first 100 samples using the TCGA barcode 
subset <- query[[1]][[1]]
barcode <- subset$cases[1:100]

TCGA_BRCA_CN_segments <- GDCquery(project = "TCGA-BRCA",
                  data.category = "Copy Number Variation",
                  data.type = "Copy Number Segment",
                  sample.type = "Primary Tumor",
                  barcode = barcode
)

GDCdownload(TCGA_BRCA_CN_segments, method = "api", files.per.chunk = 50)

#prepare a data.frame where working
data <- GDCprepare(TCGA_BRCA_CN_segments, save = TRUE, 
           save.filename= "TCGA_BRCA_CN_segments.txt")

In the last step, a dataframe with the segments of all samples is prepared. However some information are missing, so the dataset is not ready as BOBaFIT input.

Columns preparation

Further, here we show the column names of the input file for all the BOBaFIT function.

names(data)
BOBaFIT_names <- c("ID", "chr", "start", "end", "Num_Probes", 
           "Segment_Mean","Sample")
names(data)<- BOBaFIT_names
names(data)

Assign the chromosome arm with Popeye

The arm column is an very important information that support the diploid region check ofDRrefit and the chromosome list computation of ComputeNormalChromosome. As it is lacking in the TCGA-BRCAdataset, the function Popeyehas been specially designed to calculate which chromosomal arm the segment belongs to. Thanks to this algorithm, not only the TCGA-BRCA dataset, but any database you want to analyze can be analyzed by any function of BOBaFIT.

library(BOBaFIT)
segments <- Popeye(data)
chr start end width strand ID Num_Probes Segment_Mean Sample arm chrarm
1 62920 21996664 21933745 * 01428281-1653-4839-b5cf-167bc62eb147 12088 -0.4756 TCGA-BH-A18R-01A-11D-A12A-01 p 1p
1 22001786 22002025 240 * 01428281-1653-4839-b5cf-167bc62eb147 2 -7.4436 TCGA-BH-A18R-01A-11D-A12A-01 p 1p
1 22004046 22010750 6705 * 01428281-1653-4839-b5cf-167bc62eb147 2 -2.1226 TCGA-BH-A18R-01A-11D-A12A-01 p 1p
1 22011632 25256850 3245219 * 01428281-1653-4839-b5cf-167bc62eb147 1914 -0.4808 TCGA-BH-A18R-01A-11D-A12A-01 p 1p
1 25266637 25320198 53562 * 01428281-1653-4839-b5cf-167bc62eb147 22 -2.1144 TCGA-BH-A18R-01A-11D-A12A-01 p 1p
1 25320253 30316360 4996108 * 01428281-1653-4839-b5cf-167bc62eb147 2434 -0.4905 TCGA-BH-A18R-01A-11D-A12A-01 p 1p

Calculation of the Copy Number

The last step is the computation of the copy number value from the “Segment_Mean” column (logR), with the following formula. At this point the data is ready to be analyzed by the whole package.

#When data coming from SNParray platform are used, the user have to apply the
#compression factor in the formula (0.55). In case of WGS/WES data, the
#correction factor is equal to 1.  
compression_factor <- 0.55
segments$CN <- 2^(segments$Segment_Mean/compression_factor + 1)
chr start end width strand ID Num_Probes Segment_Mean Sample arm chrarm CN
1 62920 21996664 21933745 * 01428281-1653-4839-b5cf-167bc62eb147 12088 -0.4756 TCGA-BH-A18R-01A-11D-A12A-01 p 1p 1.0983004
1 22001786 22002025 240 * 01428281-1653-4839-b5cf-167bc62eb147 2 -7.4436 TCGA-BH-A18R-01A-11D-A12A-01 p 1p 0.0001686
1 22004046 22010750 6705 * 01428281-1653-4839-b5cf-167bc62eb147 2 -2.1226 TCGA-BH-A18R-01A-11D-A12A-01 p 1p 0.1378076
1 22011632 25256850 3245219 * 01428281-1653-4839-b5cf-167bc62eb147 1914 -0.4808 TCGA-BH-A18R-01A-11D-A12A-01 p 1p 1.0911264
1 25266637 25320198 53562 * 01428281-1653-4839-b5cf-167bc62eb147 22 -2.1144 TCGA-BH-A18R-01A-11D-A12A-01 p 1p 0.1392391
1 25320253 30316360 4996108 * 01428281-1653-4839-b5cf-167bc62eb147 2434 -0.4905 TCGA-BH-A18R-01A-11D-A12A-01 p 1p 1.0778690

Session info

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.1.4      BOBaFIT_1.11.0   BiocStyle_2.33.1
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3          sys_3.4.3                  
##   [3] rstudioapi_0.17.1           jsonlite_1.8.9             
##   [5] magrittr_2.0.3              GenomicFeatures_1.57.1     
##   [7] farver_2.1.2                rmarkdown_2.28             
##   [9] BiocIO_1.15.2               zlibbioc_1.51.2            
##  [11] vctrs_0.6.5                 memoise_2.0.1              
##  [13] Rsamtools_2.21.2            RCurl_1.98-1.16            
##  [15] base64enc_0.1-3             tinytex_0.53               
##  [17] htmltools_0.5.8.1           S4Arrays_1.5.11            
##  [19] progress_1.2.3              curl_5.2.3                 
##  [21] SparseArray_1.5.45          Formula_1.2-5              
##  [23] sass_0.4.9                  bslib_0.8.0                
##  [25] htmlwidgets_1.6.4           httr2_1.0.5                
##  [27] plyr_1.8.9                  cachem_1.1.0               
##  [29] buildtools_1.0.0            GenomicAlignments_1.41.0   
##  [31] lifecycle_1.0.4             pkgconfig_2.0.3            
##  [33] Matrix_1.7-1                R6_2.5.1                   
##  [35] fastmap_1.2.0               GenomeInfoDbData_1.2.13    
##  [37] MatrixGenerics_1.17.1       digest_0.6.37              
##  [39] colorspace_2.1-1            GGally_2.2.1               
##  [41] AnnotationDbi_1.69.0        S4Vectors_0.43.2           
##  [43] OrganismDbi_1.47.0          Hmisc_5.2-0                
##  [45] GenomicRanges_1.57.2        RSQLite_2.3.7              
##  [47] labeling_0.4.3              filelock_1.0.3             
##  [49] fansi_1.0.6                 polyclip_1.10-7            
##  [51] httr_1.4.7                  abind_1.4-8                
##  [53] compiler_4.4.1              withr_3.0.2                
##  [55] bit64_4.5.2                 htmlTable_2.4.3            
##  [57] backports_1.5.0             BiocParallel_1.39.0        
##  [59] DBI_1.2.3                   ggstats_0.7.0              
##  [61] ggforce_0.4.2               biomaRt_2.61.3             
##  [63] MASS_7.3-61                 rappdirs_0.3.3             
##  [65] DelayedArray_0.31.14        rjson_0.2.23               
##  [67] tools_4.4.1                 foreign_0.8-87             
##  [69] nnet_7.3-19                 glue_1.8.0                 
##  [71] restfulr_0.0.15             grid_4.4.1                 
##  [73] checkmate_2.3.2             cluster_2.1.6              
##  [75] reshape2_1.4.4              generics_0.1.3             
##  [77] gtable_0.3.6                BSgenome_1.73.1            
##  [79] tidyr_1.3.1                 ensembldb_2.29.1           
##  [81] hms_1.1.3                   data.table_1.16.2          
##  [83] xml2_1.3.6                  utf8_1.2.4                 
##  [85] XVector_0.45.0              BiocGenerics_0.51.3        
##  [87] pillar_1.9.0                stringr_1.5.1              
##  [89] tweenr_2.0.3                BiocFileCache_2.13.2       
##  [91] lattice_0.22-6              rtracklayer_1.65.0         
##  [93] bit_4.5.0                   biovizBase_1.53.0          
##  [95] RBGL_1.81.0                 tidyselect_1.2.1           
##  [97] maketools_1.3.1             Biostrings_2.73.2          
##  [99] knitr_1.48                  gridExtra_2.3              
## [101] ggbio_1.53.0                IRanges_2.39.2             
## [103] ProtGenerics_1.37.1         SummarizedExperiment_1.35.5
## [105] stats4_4.4.1                xfun_0.48                  
## [107] Biobase_2.65.1              matrixStats_1.4.1          
## [109] stringi_1.8.4               UCSC.utils_1.1.0           
## [111] lazyeval_0.2.2              yaml_2.3.10                
## [113] evaluate_1.0.1              codetools_0.2-20           
## [115] NbClust_3.0.1               tibble_3.2.1               
## [117] graph_1.83.0                BiocManager_1.30.25        
## [119] cli_3.6.3                   rpart_4.1.23               
## [121] munsell_0.5.1               jquerylib_0.1.4            
## [123] dichromat_2.0-0.1           Rcpp_1.0.13                
## [125] GenomeInfoDb_1.41.2         dbplyr_2.5.0               
## [127] png_0.1-8                   XML_3.99-0.17              
## [129] parallel_4.4.1              ggplot2_3.5.1              
## [131] blob_1.2.4                  prettyunits_1.2.0          
## [133] AnnotationFilter_1.31.0     plyranges_1.25.0           
## [135] bitops_1.0-9                txdbmaker_1.1.2            
## [137] VariantAnnotation_1.51.2    scales_1.3.0               
## [139] purrr_1.0.2                 crayon_1.5.3               
## [141] rlang_1.1.4                 KEGGREST_1.45.1
Colaprico, Antonio, Tiago C. Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S. Sabedot, et al. 2015. “TCGAbiolinks: An R/Bioconductor Package for Integrative Analysis of TCGA Data.” Nucleic Acids Research 44 (8): e71–71. https://doi.org/10.1093/nar/gkv1507.
Tomczak, Katarzyna, Patrycja Czerwińska, and Maciej Wiznerowicz. 2015. “Review the Cancer Genome Atlas (TCGA): An Immeasurable Source of Knowledge.” Współczesna Onkologia 1A: 68–77. https://doi.org/10.5114/wo.2014.47136.