BgeeCall
is a
collection of functions that uses Bgee
expertise to create RNA-Seq gene expression present/absent calls.
The BgeeCall
package allows to:
If you find a bug or have any issues with BgeeCall
please write a bug report in our GitHub issues
manager.
In Bgee RNA-Seq calls are generated using a threshold specific to each RNA-Seq library, calculated using reads mapped to reference intergenic regions. This is unlike the more usual use of an arbitrary threshold below which a gene is not considered as expressed (e.g log2(TPM) = 1).
Bgee is a database to retrieve and compare gene expression patterns in multiple animal species and produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data). It notably integrates RNA-Seq libraries for 29 species.
Reference intergenic regions are defined in the Bgee RNA-Seq pipeline. Candidate intergenic regions are defined using gene annotation data. For each species, over all available libraries, reads are mapped to these intergenic regions with kallisto, as well as to genes. This “intergenic expression” is deconvoluted to distinguish reference intergenic from non annotated genes, which have higher expression. Reference intergenic regions are then defined as intergenic regions with low expression level over all RNA-Seq libraries, relative to genes. This step allows not to consider regions wrongly considered as intergenic because of potential gene annotation quality problem as intergenic. For more information please refer to the Bgee RNA-Seq pipeline.
BgeeCall pipeline allows to download reference intergenic regions resulting from the expertise of the Bgee team. Moreover BgeeCall allows to use these reference intergenic regions to automatically generate calls for your own RNA-Seq libraries as long as the species is integrated to Bgee.
By default BgeeCall calculate a pValue to define calls. By default genes are consider present if the pValue is lower or equal to 0.05. More information on this pValue and potential other approaches to generate the calls are available here
In R:
BgeeCall is highly tunable. Do not hesitate to have a look at the reference manual to have a precise descripton of all slots of the 4 main S4 classes (AbundanceMetadata, KallistoMetadata, BgeeMetadata and UserMetadata) or of all available functions. BgeeCall needs kallisto to run. If you do not have kallisto installed you will find more information how to install it here
With the BgeeCall package it is easy to generate present/absent gene expression calls. Generation of the kallisto transcriptome index can require a lot of time. As the time needed for this step depend on the size of the transcriptome we choose, as an example, the smallest transcriptome file among all species available on Bgee (C. elegans). To generate these calls you will need :
For this vignette we created a toy fastq file example based on the SRX099901 RNA-Seq library using the ShortRead R package
library("ShortRead")
# keep 48.000 reads
sampler <- FastqSampler(file.path("absolute_path","/SRX099901/SRR350955.fastq.gz"), 48000)
set.seed(1); SRR350955 <- yield(sampler)
writeFastq(object = SRR350955,
file =file.path( "absolute_path","SRX099901_subset", "SRR350955_subset.fastq.gz"),
mode = "w", full = FALSE, compress = TRUE)
In this example we used the Bioconductor AnnotationHub to load transcriptome and gene annotations but you can load them from wherever you want.
library(AnnotationHub)
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#> mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#> unsplit, which.max, which.min
#> Loading required package: BiocFileCache
#> Loading required package: dbplyr
library(rtracklayer)
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#>
#> Attaching package: 'rtracklayer'
#> The following object is masked from 'package:AnnotationHub':
#>
#> hubUrl
ah <- AnnotationHub()
ah_resources <- query(ah, c("Ensembl", "Caenorhabditis elegans", "84"))
annotation_object <- ah_resources[["AH50789"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> Importing File into R ..
# remove MtDNA not tag as C. elegans genome
annotation_object <- dropSeqlevels(annotation_object, "MtDNA", "coarse")
transcriptome_object <- import.2bit(ah_resources[["AH50453"]])
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
Once you have access to transcriptome, gene annotations and your
RNA-Seq library, an object of class UserMetadata
has to be
created.
# create an object of class UserMetadata and specify the species ID
user_BgeeCall <- new("UserMetadata", species_id = "6239")
# import annotation and transcriptome in the user_BgeeCall object
# it is possible to import them using an S4 object (GRanges, DNAStringSet) or a file (gtf, fasta)
user_BgeeCall <- setAnnotationFromObject(user_BgeeCall, annotation_object, "WBcel235_84")
user_BgeeCall <- setTranscriptomeFromObject(user_BgeeCall, transcriptome_object, "WBcel235")
# provide path to the directory of your RNA-Seq library
user_BgeeCall <- setRNASeqLibPath(user_BgeeCall,
system.file("extdata",
"SRX099901_subset",
package = "BgeeCall"))
Note that in BgeeCall is possible to specify the annotation source. The default source is ensembl, but gencode files can also be used, by specifying that in the attribute gtf_source in the class UserMetadata.
And that’s it… You can run the generation of your present/absent gene expression calls
#> Querying Bgee to get intergenic release information...
#> Note: importing `abundance.h5` is typically faster than `abundance.tsv`
#> reading in files with read.delim (install 'readr' package for speed up)
#> 1
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> Note: importing `abundance.h5` is typically faster than `abundance.tsv`
#> reading in files with read.delim (install 'readr' package for speed up)
#> 1
#> summarizing abundance
#> summarizing counts
#> summarizing length
#> Generate present/absent expression calls using pValuecutoff
Each analyze generates 5 files and return path to each one of them.
head(read.table(calls_output$calls_tsv_path, header = TRUE), n = 5)
#> id abundance counts length biotype type zScore
#> 1 III_10079208_10080946 0 0 1560 <NA> intergenic NA
#> 2 III_1011875_1015179 0 0 3126 <NA> intergenic NA
#> 3 III_1016637_1020469 0 0 3654 <NA> intergenic NA
#> 4 III_10289112_10294636 0 0 5346 <NA> intergenic NA
#> 5 III_10311601_10314095 0 0 2316 <NA> intergenic NA
#> pValue call
#> 1 NA absent
#> 2 NA absent
#> 3 NA absent
#> 4 NA absent
#> 5 NA absent
read.table(calls_output$cutoff_info_file_path)
#> V1 V2
#> 1 libraryId SRX099901_subset
#> 2 cutoffTPM 27.1747
#> 3 proportionGenicPresent 17.8245677723828
#> 4 numberGenicPresent 3928
#> 5 numberGenic 22037
#> 6 proportionCodingPresent 19.1128282877684
#> 7 numberPresentCoding 3908
#> 8 numberCoding 20447
#> 9 proportionIntergenicPresent 0.0233808744447042
#> 10 numberIntergenicPresent 1
#> 11 numberIntergenic 4277
#> 12 pValueCutoff 0.05
#> 13 meanIntergenic 5.98186903584813
#> 14 sdIntergenic 2.50920239407257
head(read.table(calls_output$abundance_tsv, header = TRUE), n = 5)
#> target_id length eff_length est_counts tpm
#> 1 Y110A7A.10 1787 1556.20 2 27.0993
#> 2 F27C8.1 1940 1761.00 0 0.0000
#> 3 F07C3.7 1728 1549.00 0 0.0000
#> 4 F52H2.2 1739 1519.92 1 13.8730
#> 5 T13A10.10a 1734 1555.00 0 0.0000
read.table(calls_output$S4_slots_summary, header = TRUE, sep = "\t")
#> Slot.name
#> 1 AbundanceMetadata@tool_name
#> 2 AbundanceMetadata@txOut
#> 3 AbundanceMetadata@ignoreTxVersion
#> 4 AbundanceMetadata@cutoff
#> 5 AbundanceMetadata@read_size_kmer_threshold
#> 6 BgeeMetadata@intergenic_release
#> 7 UserMetadata@species_id
#> 8 UserMetadata@reads_size
#> 9 UserMetadata@rnaseq_lib_path
#> 10 UserMetadata@transcriptome_name
#> 11 UserMetadata@annotation_name
#> 12 UserMetadata@simple_arborescence
#> 13 output_dir
#> Slot.value
#> 1 kallisto
#> 2 FALSE
#> 3 FALSE
#> 4 0.05
#> 5 50
#> 6 1.0
#> 7 6239
#> 8 51
#> 9 /tmp/Rtmp0g2e0Y/Rinst19457fc035de/BgeeCall/extdata/SRX099901_subset
#> 10 WBcel235
#> 11 WBcel235_84
#> 12 TRUE
#> 13 /tmp/Rtmp0g2e0Y/Rinst19457fc035de/BgeeCall/extdata/intergenic_1.0/all_results/SRX099901_subset
Note that in the pValue approach (default method to generate present/absent gene expression calls) the calls file as well as the cutoff_info_file and the TPM_distribution files not contain extension name of the approach used. On the other hand, all the other approaches contain the approach name in the file names.
You will potentialy be also interested to generate present/absent
calls on different RNA-Seq libraries, potentially on different species,
or using The main function generate_presence_absence()
allows to generate present/absent calls from a UserMetadata object but
also from a data frame or a tsv file depending on the arguments of the
function you use. Please choose one of the three following arguments : -
userMetadata : Allows to generate present/absent calls for one RNA-Seq
library using one object of the class UserMetadata.
- userDataFrame : Provide a dataframe where each row correspond to one
present/absent call generation. It allows to generate present/absent
calls on different libraries, species, transcriptome, genome
annotations, etc. - userFile : Similar to userDataFrame except that the
information are stored in a tsv file. A template of this file called
userMetadataTemplate.tsv
is available at the root of the
package.
Columns of the dataframe or the tsv file are :
getwd()
function and correspond to the working
directory of your R session. If not interested by this option, leave the
column empty.working_path
column. However, this column allows you to
define a different output_directory for RNA-Seq results. For instance it
allows you to save calls information directly in the RNA-Seq directory.
If not interested by this option, leave the column empty.Once the file has been fill in expression calls can be generated with :
Generate expression calls for each RNA-Seq libraries
Different releases of reference intergenic sequences are available. It is possible to list all these releases :
list_intergenic_release()
#> Downloading release information of reference intergenic sequences...
#> release releaseDate FTPURL
#> 1 1.0 2021-06-11 https://www.bgee.org/ftp/intergenic/1.0/
#> 2 0.2 2019-02-07 https://www.bgee.org/ftp/intergenic/0.2/
#> 3 0.1 2018-12-21 https://www.bgee.org/ftp/intergenic/0.1/
#> 4 community 2019-07-22
#> 5 custom 2019-07-22
#> referenceIntergenicFastaURL
#> 1 https://www.bgee.org/ftp/intergenic/1.0/ref_intergenic/SPECIES_ID_intergenic.fa.gz
#> 2 https://www.bgee.org/ftp/intergenic/0.2/ref_intergenic/SPECIES_ID_intergenic.fa.gz
#> 3 https://www.bgee.org/ftp/intergenic/0.1/ref_intergenic/SPECIES_ID_intergenic.fa.gz
#> 4
#> 5
#> minimumVersionBgeeCall
#> 1 0.9.9
#> 2 0.9.9
#> 3 0.9.9
#> 4 1.1.0
#> 5 1.1.0
#> description
#> 1 intergenic regions used to generate Bgee 15.
#> 2 cleaned intergenic sequences based on release 0.1 (remove blocks of Ns longer than 100 and sequences containing more than 5% of Ns).
#> 3 intergenic regions used to generate Bgee 14.
#> 4 Release allowing to access to all reference intergenic sequences generated by the community and not present in Bgee for the moment.
#> 5 Release allowing to use your own FASTA reference intergenic sequences. When this release is selected BgeeCall will use the sequences at UserMetadata@custom_intergenic_path to generate present/absent calls.
#> messageToUsers
#> 1
#> 2 be careful, this intergenic release has not been tested by Bgee
#> 3
#> 4 These reference intergenic sequences have not been generated by Bgee. Use with caution.
#> 5 You decided to use your own reference intergenic sequences
It is then possible to choose one specific release to create a
BgeeMetadata
object. Always use the setter method
setIntergenicRelease()
when changing the release of an
already existing BgeeMetadata object.
# create BgeeMetadata object and define one reference intergenic release
bgee <- new("BgeeMetadata", intergenic_release = "0.1")
#> Querying Bgee to get intergenic release information...
# change the reference intergenic release of your BgeeMetadata object
bgee <- setIntergenicRelease(bgee, "0.2")
#> IMPORTANT : be careful, this intergenic release has not been tested by Bgee
By default the reference intergenic release used when a
BgeeMetadata
object is created is the last stable one
created by the Bgee team.
Core reference intergenic
releases are created by the
Bgee team when a lot of new RNA-Seq libraries have been manually curated
for already existing species and/or for new species. These releases are
the only ones with a release number (e.g “0.1”). Each of these releases
contains reference intergenic sequences for a list of species. Bgee
reference intergenic sequences have been generated using Bgee team
expertise. The RNA-Seq libraries were manually curated as healthy and
wild type. Quality Control have been done along all steps of generation
of these sequences. Reference intergenic sequences have been selected
from all potential intergenic regions (see Bgee
pipeline). BgeeCall allows to generate gene expression call from
Bgee reference intergenic sequences for any RNA-Seq libraries as long as
these sequences have been generated by the Bgee team. A tsv file
containing all species available for current release of reference
intergenic is available here.
This file also contains a column describing the number of RNA-Seq
libraries used to generated the reference intergenic sequences of each
species. It is also possible to list in R all species for which Bgee
reference intergenic sequences have been created :
list_bgee_ref_intergenic_species(myBgeeMetadata = bgee)
#> speciesId speciesName numberOfLibraries genomeVersion
#> 1 9606 Homo sapiens 5026 GRCh38.p5
#> 2 10090 Mus musculus 133 GRCm38.p4
#> 3 9544 Macaca mulatta 90 MMUL1.0
#> 4 7955 Danio rerio 67 GRCz10
#> 5 8364 Xenopus tropicalis 66 JGI4.2
#> 6 6239 Caenorhabditis elegans 50 WBcel235
#> 7 9031 Gallus gallus 45 Galgal4
#> 8 10116 Rattus norvegicus 36 Rnor_6.0
#> 9 9913 Bos taurus 33 UMD3.1
#> 10 13616 Monodelphis domestica 19 monDom5
#> 11 9258 Ornithorhynchus anatinus 17 OANA5
#> 12 7240 Drosophila simulans 17 GCA_000259055.1
#> 13 9598 Pan troglodytes 15 CHIMP2.1.4
#> 14 7237 Drosophila pseudoobscura 14 GCA_000001765.2
#> 15 7227 Drosophila melanogaster 14 BDGP6
#> 16 9593 Gorilla gorilla 13 gorGor3.1
#> 17 9597 Pan paniscus 12 CHIMP2.1.4
#> 18 9823 Sus scrofa 10 Sscrofa10.2
#> 19 10141 Cavia porcellus 9 Felis_catus_6.2
#> 20 9685 Felis catus 9 cavPor3
#> 21 7230 Drosophila mojavensis 8 EquCab2
#> 22 9796 Equus caballus 8 GCA_000005175.1
#> 23 9986 Oryctolagus cuniculus 6 eriEur1
#> 24 9615 Canis lupus familiaris 6 CanFam3.1
#> 25 9365 Erinaceus europaeus 6 OryCun2.0
#> 26 7244 Drosophila virilis 4 GCA_000005245.1
#> 27 28377 Anolis carolinensis 4 AnoCar2.0
#> 28 7217 Drosophila ananassae 4 GCA_000005975.1
#> 29 7245 Drosophila yakuba 4 GCA_000005115.1
If you want to use BgeeCall on a species for which Bgee does not provide reference intergenic sequences you have the possibility to create them by yourself and share them with the Bgee community by following all steps of this tutorial. Do not forget that the number of RNA-Seq libraries is a key point to the generation of precise reference intergenic sequences. It is possible to list in R all species for which reference intergenic sequences have been created by the community using the following code
list_community_ref_intergenic_species()
#> speciesId numberOfLibraries annotationVersion genomeVersion kallistoVersion
#> 1 10036 15 MesAur1.0 MesAur1.0 0.46.0
#> 2 13686 243 Si_gnG Si_gnG 0.44.0
If reference intergenic sequences of the species you are interested in are available only from the community release it is then possible to use this release to generate your present/absent calls
If you generated your own reference intergenic sequences follwowing this tuorial but did not share them for the moment (do not forget to do it…), it is also possible to use BgeeCall with a file containing the sequences. In this case you need to select the custom release and provide the path to the file containing reference intergenic sequences :
kallisto generates TPMs at the transcript level. In the Bgee pipeline
we summarize this expression at the gene level to calculate our
present/absent calls. In BgeeCall
it is now possible to
generate present/absent calls at the transcript level. Be careful when
using this feature as it has not been tested for the moment. To generate
such calls you only have to create one object of the class
KallistoMetadata
and edit the value of one attribute
By default BgeeCall
suppose that kallisto is installed.
If kallisto is not installed on your computer you can either :
kallisto <- new("KallistoMetadata", download_kallisto = TRUE)
calls_output <- generate_calls_workflow(myAbundanceMetadata = kallisto, userMetadata = user_BgeeCall)
By default kallisto is run with the same parameters that we use in the RNA-Seq Bgee pipeline:
It is possible to modify them and use your favourite kallisto parameters
By default 2 indexes with 2 different kmer sizes can be used by
BgeeCall
The default kmer size of kallisto (31) is used for
libraries with reads length equal or larger than 50 bp. A kmer size of
15 is used for libraries with reads length smaller than 50 bp. We
decided not to allow to tune kmers size because the generation of the
index is time consuming and index generation takes even more time with
small kmers size (< 15bp). However it is possible to modify the
threshold of read length allowing to choose between default and small
kmer size.
# libraries with reads smaller than 70bp will use the index with kmer size = 15
kallisto <- new("KallistoMetadata", read_size_kmer_threshold = 70)
calls_output <- generate_calls_workflow(myAbundanceMetadata = kallisto, userMetadata = user_BgeeCall)
Note that for libraries with read length unspecified the default kmer size(31) will be used.
By default gene expression calls are generated using all runs of the RNA-Seq library. It is possible to select only a subset of these runs.
# RNA-Seq run SRR350955_subsetof from the RNA-Seq library will be used to generate the calls
user_BgeeCall <- setRunIds(user_BgeeCall, c("SRR350955_subset"))
calls_output <- run_from_object(myUserMetadata = user_BgeeCall)
When run IDs are selected, the name output directory combine the
library ID and all selected run IDs. In our example the expression calls
will be stored in the directory
SRX099901_SRR350955_subset
.
By default BgeeCall generate calls using a pValue approach. In order to generate the pValue per gene id we compute the numerical measure, Z-score, that describes the value relationship to the mean. The Z-score measure is done in terms of standards deviations from the referent mean, in this case using a set of reference intergenic regions, as specified in the formula below:
$$ zScore = \frac {log2(tpmValue) - mean(log2(IntergenicTpmValues))}{sd(log2(IntergenicTpmValues))} $$ From the Z-score value, for each gene id, we compute a distribution of pValues using the following formula:
pValue = pnorm(zscore, lower.tail = FALSE) Genes are consider present if the pValue is lower or equal to 0.05.
By default all genes with an abundance higher than 0 (i.e having at least one read mapped to a transcript) and with a pValue lower or equal to 0.05 are considered as present. Other genes are called absent. It is possible to modify the pValue cutoff. Be careful when editing this value as it has a big impact on your present absent.
Expression calls can also be generated using a threshold of intergenic sequences called present. This approach was used to generate Bgee expression calls until release 14 with the formula :
$$ \frac {proportion\ of\ reference\ intergenic\ present}{proportion\ of\ protein\ coding\ present} = 0.05 $$ It is possible to change the cutoff value of this ratio.
The expression calls can also be generated using the qValue approach. In this approach we perform a linear interpolation for each density region, genic and reference intergenic, followed by the numerical integration. Then for each unique abundance value (TPM) we integrate and scale across the region. After that, the correspondent qValue for each gene id is computed following the formula:
$$ qValue = \frac {intergenic}{intergenic + genic} $$ ### Collect statistics
In BgeeCall the user are able to collect all the statistics provided by the different approaches to call expressed genes at individual library. To provide this informative table, all cutoff_info_file_ from each individual library and from each individual approach are used to retrieve the correspondent information.
By default BgeeCall write output messages for all parts of the workflow. It is possible not to write any messages by changing the value of the slot verbose in the UserMetadata object. By default this value is set to true but it is possible to change it to false with this line :
Generation of present/absent expression calls is done in several steps. It is possible to force overwritting already existing intermediary files : - overwrite_index : slot of the object KallistoMetadata. The value has to be a logical. If FALSE (default), the index generation step is skiped if one index already exists. If TRUE, the kallisto index will be generated even if one index already exists. - overwrite_quant : slot of the object KallistoMetadata. The value has to be a logical. If FALSE (default), the kallisto quantification step is skiped if a quantification file already exists. If TRUE, the kallisto quantification step will be run even if a quantification file already exists. - overwrite_calls : slot of the object KallistoMetadata. The value has to be a logical. If FALSE, the generation of present/absent calls is skiped if an index already exists. If TRUE (default), the generation of present/absent calls will be run even if calls were already generated.
It can happen that calls are generated using transcriptome or annotations containing transcript version (number after a dot in transcript IDs e.g ENSMUST00000082908.2) and annotations or transcriptome without transcript version. This is problematic when using tximport to transform abundance at transcript level to abundance at gene level and result in an error.
Error in .local(object, ...) :
None of the transcripts in the quantification files are present
in the first column of tx2gene. Check to see that you are using
the same annotation for both.
Example IDs (file): [, ...]
Example IDs (tx2gene): [ENSMUST00000193812, ENSMUST00000082908, ENSMUST00000192857, ...]
This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.
Calls: do.call ... tximport -> summarizeToGene -> summarizeToGene -> .local
Execution halted
To solve this error tximport implemented an option called ignoreTxVersion that remove the transcript version from both transcriptome and annotations. It is possible to use this option by modifying the value of the slot ignoreTxVersion (default FALSE) of the S4 class KallistoMetadata.
By default the arborescence of directories created by
BgeeCall
is as simple as possible. the results will be
created using the path
working_path/intergenic_release/all_results/libraryId
.
Generating present/absent gene expression calls for the same RNA-Seq
library using different transcriptome or annotation versions using this
arborescence will overwrite previous results. The
UserMetadata
class has an attribute
simple_arborescence
that is TRUE
by default.
If FALSE
, a complexe arborescence of directories containing
the name of the annotation and transcriptome files will be created. This
complex arborescence will then allow to generate present/absent calls
for the same library using different version of transcriptome or
annotaiton.
By default directories used to save present/absent calls are
subdirectories of UserMetadata@working_path
. However it is
possible to select the directory where you want the calls to be
generated.
This output directory will only contains results generated at the
RNA-Seq library level. All data generated at species level are still
stored using the UserMetadata@working_path
. They can then
still be reused to generate calls from other libraries of the same
species.
Two functions are available to run BgeeCall on a slurm queuing system. Parameters described below are available for both of them.
The full idea of using a cluster is to parallelize your jobs. By
default 10 jobs are run at the same time. It is possible to modify this
number with the parameter nodes
.
In order to be able to check files automatically generated to run the jobs it is possible to generate these files without submiting your jobs. More information on created files are available on the vignette of the (https://cran.r-project.org/web/packages/rslurm/vignettes/rslurm.html)[rslurm package].
A bash scirpt is automatically created to run the jobs. This script contains default slurm options (array, cpus-per-task, job-name, output). All other slurm options recognized by the sbatch command can be updated b creating a named list where name correspond to long name of options (e.g do not use ‘p’ but ‘partition’).
In some cluster programs are not loaded by default. The modules parameter allows to load them by adding one line in the sbatch script. This option has been implemented to add modules but could potentially be used to add any custom line of code in the sbatch script.
By default except for columns present in the tsv file all other slots of the 3 BgeeCall classes will use default values. In order to tune these parameters it is possible to create the objects and pass them to the slurm functions. When generating these objects it is mandatory to keep the same name as in the example below.
# create BgeeCall objects and use them to generate indexes
kallistoMetadata <- new("KallistoMetadata", download_kallisto=TRUE)
userMetadata <- new("UserMetadata", working_path = "/path/to/working/dir")
bgeeMetadata <- new("BgeeMetadata", intergenic_release = "0.1")
generate_slurm_indexes(userFile = "path/to/file.tsv", kallistoMetadata = kallistoMetadata, bgeeMetadata = bgeeMetadata, userMetadata = userMetadata)
In the BgeeCall package we have the possibility to perform calls of expression by merging/combine multiple libraries that belongs to a particular condition. In order to use this functionality, the calls at individual library need to be done by using the p-value or q-value approach, since this approaches provide for each genei a quantitative metric.
If the calls of expression genes, at individual library, was done using the pValue approach, then the merging will be done by using the Benjamini & Hochberg method.
For this method a set of pValues are collected, for each genei across n libraries, and then the referent p.adjusted values are computed. To call expressed genes in the merging, we verify if one of the p.adjusted values is lower then the cutoff (threshold value provided by the user), if yes, the genei is classified as present for the condition, otherwise is absent.
mergingLibraries <- merging_libraries(userFile = "path/to/userFile.tsv", approach = "BH", condition = "species_id", cutoff = 0.05, outDir = "path/to/output_directory/")
On the other side, if the approach used to call expressed genes, at individual library, was the qValue, then the merging will be done by using the inverse fdr correction, this means by applying the correspondent formula:
$$ FDR_{inverse} = (1-((1-q)^{(\frac {1}{n})})) $$ where q is the desired cut-off (threshold value provided by the user) and n is the number of libraries.
# merging libraries
mergingLibraries <- merging_libraries(userFile = "path/to/userFile.tsv", approach = "fdr_inverse", condition = "species_id", cutoff = 0.05, outDir = "path/to/output_directory/")
To call expressed genes in the merging condition, we verify if one of the libraries have a q-Value lower then the FDR_inverse, if yes, the genei is classified as present, otherwise is classified as absent.
In the merging_libraries() function, different parameters need to be passed to the referent arguments: userFile, approach, condition, cutoff or outDir.
In the userFile argument a path to a text file containing all information about the target libraries to be merged. This file should contain a fundamental column called: specied_id in order to run the merging_libraries() function. This column should not be empty and also an important request is that you should have at least 2 libraries to be processed, since this is a merging process.
If you want to merge/combine libraries for a more detailed condition you need to add the correspondent extra columns to your user file. Note that the argument condition recognize the following parameters: specied_id, anatEntity, devStage, sex and strain. This means that your table columns also should have the same ID. The order of the columns are not relevant when the user file is structured, as well as, the order of the parameters passed to the condition argument.
Is always recommended to use “-” instead of empty rows when some information is missing, as represented in the BgeeCall/inst/userMetadataTemplate_merging.tsv
The argument approach allow to specify the method that will be used to make the calls of expression in the merging libraries, and as described before, the condition argument will allow to specify the particular condition of interest, the condition argument allow to target the correspondent libraries where the merging will occur.
More restricted cut-off can be applied to call expressed genes in the merging libraries for a target condition, for this the cutoff argument can be modified, otherwise by default is 0.05.
# merging libraries where p-values were calculated
mergingLibraries_pValue <- merging_libraries(userFile = "path/to/userFile.tsv", approach = "BH", condition = c("species_id", "anatEntity", "strain"), cutoff = 0.005, outDir = "path/to/output_directory/")
# merging libraries where q-values were calculated
mergingLibraries_qValue <- merging_libraries(userFile = "path/to/userFile.tsv", approach = "fdr_inverse", condition = c("species_id", "devStage" ,"anatEntity", "strain"), cutoff = 0.001, outDir = "path/to/output_directory/")
#Session Info
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] rtracklayer_1.67.0 GenomicRanges_1.59.1 GenomeInfoDb_1.43.2
#> [4] IRanges_2.41.2 S4Vectors_0.45.2 AnnotationHub_3.15.0
#> [7] BiocFileCache_2.15.0 dbplyr_2.5.0 BiocGenerics_0.53.3
#> [10] generics_0.1.3 BgeeCall_1.23.0 rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] sjlabelled_1.2.0 tidyselect_1.2.1
#> [3] dplyr_1.1.4 blob_1.2.4
#> [5] filelock_1.0.3 Biostrings_2.75.3
#> [7] bitops_1.0-9 fastmap_1.2.0
#> [9] RCurl_1.98-1.16 GenomicAlignments_1.43.0
#> [11] XML_3.99-0.17 digest_0.6.37
#> [13] mime_0.12 lifecycle_1.0.4
#> [15] KEGGREST_1.47.0 RSQLite_2.3.9
#> [17] magrittr_2.0.3 compiler_4.4.2
#> [19] rlang_1.1.4 sass_0.4.9
#> [21] tools_4.4.2 yaml_2.3.10
#> [23] data.table_1.16.4 knitr_1.49
#> [25] S4Arrays_1.7.1 bit_4.5.0.1
#> [27] curl_6.0.1 DelayedArray_0.33.3
#> [29] abind_1.4-8 BiocParallel_1.41.0
#> [31] withr_3.0.2 purrr_1.0.2
#> [33] sys_3.4.3 grid_4.4.2
#> [35] rslurm_0.6.2 Rhdf5lib_1.29.0
#> [37] insight_1.0.0 SummarizedExperiment_1.37.0
#> [39] cli_3.6.3 crayon_1.5.3
#> [41] httr_1.4.7 rjson_0.2.23
#> [43] DBI_1.2.3 cachem_1.1.0
#> [45] rhdf5_2.51.1 zlibbioc_1.52.0
#> [47] parallel_4.4.2 AnnotationDbi_1.69.0
#> [49] BiocManager_1.30.25 XVector_0.47.1
#> [51] restfulr_0.0.15 matrixStats_1.4.1
#> [53] vctrs_0.6.5 Matrix_1.7-1
#> [55] jsonlite_1.8.9 bit64_4.5.2
#> [57] GenomicFeatures_1.59.1 maketools_1.3.1
#> [59] jquerylib_0.1.4 glue_1.8.0
#> [61] codetools_0.2-20 BiocVersion_3.21.1
#> [63] BiocIO_1.17.1 UCSC.utils_1.3.0
#> [65] tibble_3.2.1 pillar_1.10.0
#> [67] rappdirs_0.3.3 htmltools_0.5.8.1
#> [69] rhdf5filters_1.19.0 GenomeInfoDbData_1.2.13
#> [71] R6_2.5.1 tximport_1.35.0
#> [73] evaluate_1.0.1 lattice_0.22-6
#> [75] Biobase_2.67.0 png_0.1-8
#> [77] Rsamtools_2.23.1 memoise_2.0.1
#> [79] bslib_0.8.0 SparseArray_1.7.2
#> [81] xfun_0.49 MatrixGenerics_1.19.0
#> [83] buildtools_1.0.0 sjmisc_2.8.10
#> [85] pkgconfig_2.0.3