The screenCounter package implements several functions to quantify the frequency of barcodes in a sequencing screen experiment. These functions on the raw FASTQ files, yielding a matrix of counts for each barcode in each sample for downstream analysis. Quantification can be performed for both single and combinatorial barcodes, though only single-end sequencing data are currently supported.
The “barcode” is considered to be the entire sequence structure that is used to label a particular molecular or biological entity. This includes regions of variable sequence that distinguish one barcode from another as well as constant sequences that flank or separate the variable regions. The most common barcode design uses only one variable region, which we refer to as a “single barcode”.
# Example of a single barcode:
CAGCAGCATGCTGNNNNNNNNNNNNNNCAGCATCGTGC
-------------^^^^^^^^^^^^^^-----------
constant variable constant
flank region flank
To demonstrate, let’s mock up a FASTQ file from a single-end sequencing screen. For simplicity, we will assume that each read only contains the barcode. However, it is entirely permissible to have additional (unknown) flanking sequences in the read.
# Our pool of known variable sequences, one per barcode:
known <- c("AAAAAAAA", "CCCCCCCC", "GGGGGGGG", "TTTTTTTT")
# Mocking up some sequence data, where each read randomly contains
# one of the variable sequences, flanked by constant regions.
library(Biostrings)
chosen <- sample(known, 1000, replace=TRUE)
reads <- sprintf("GTAC%sCATG", chosen)
names(reads) <- sprintf("READ_%i", seq_along(reads))
# Writing to a FASTQ file.
single.fq <- tempfile(fileext=".fastq")
writeXStringSet(DNAStringSet(reads), file=single.fq, format="fastq")
We quantify single barcodes across one or more files using the
matrixOfSingleBarcodes()
function. This produces a
SummarizedExperiment
containing a count matrix of
frequencies of each barcode (row) in each file (column). The order of
rows corresponds to the order of variable regions in known
that distinguishes each barcode.
library(screenCounter)
out <- matrixOfSingleBarcodes(single.fq,
flank5="GTAC", flank3="CATG", choices=known)
out
## class: SummarizedExperiment
## dim: 4 1
## metadata(0):
## assays(1): counts
## rownames(4): AAAAAAAA CCCCCCCC GGGGGGGG TTTTTTTT
## rowData names(1): choices
## colnames(1): file121b1e2762eb.fastq
## colData names(3): paths nreads nmapped
## file121b1e2762eb.fastq
## AAAAAAAA 237
## CCCCCCCC 250
## GGGGGGGG 242
## TTTTTTTT 271
We specify the constant sequences that immediately flank the variable region1. This improves stringency of barcode identification as the barcode will not be recognized in the read sequence unless the constant sequences match perfectly. It also improves speed as the matching of the variable sequence is only performed at positions along the read where the flanking constant sequences have matched.
Obviously, users should not supply the full length of the flanking
sequence if that sequence is not captured in the read. For example, the
3’ flanking sequence may be lost in experiments using short reads that
only reach to the end of the variable region. In such cases, an empty
string should be specified as flank3=
.
A more complex experimental design involves combinatorial barcodes where multiple variable regions are randomly ligated to form a single sequence. This exploits combinatorial complexity to generate a large number of unique barcodes from a limited pool of known variable sequences.
# Example of a combinatorial barcode:
CAGCTANNNNNNNNNNCACGNNNNNNNNNNCAGCT
------^^^^^^^^^^----^^^^^^^^^^-----
variable variable
To demonstrate, let’s mock up another FASTQ file of single-end read data. Again, for simplicity, we will assume that each read sequence contains only the barcode.
# Our pool of known variable sequences:
known1 <- c("AAAA", "CCCC", "GGGG", "TTTT")
known2 <- c("ATTA", "CGGC", "GCCG", "TAAT")
# Mocking up some sequence data, where each read randomly contains
# two of the variable sequences within a template structure.
library(Biostrings)
chosen1 <- sample(known1, 1000, replace=TRUE)
chosen2 <- sample(known2, 1000, replace=TRUE)
reads <- sprintf("GTAC%sCATG%sGTAC", chosen1, chosen2)
names(reads) <- sprintf("READ_%i", seq_along(reads))
# Writing to a FASTQ file.
combo.fq <- tempfile(fileext=".fastq")
writeXStringSet(DNAStringSet(reads), file=combo.fq, format="fastq")
We quantify combinatorial barcodes across one or more files using the
matrixOfComboBarcodes()
function. This requires a template
for the barcode structure to specify how the variable sequences are used
to construct the final barcode. It is permissible for each variable
sequence to be sampled from a different known pool.
out <- matrixOfComboBarcodes(combo.fq,
template="GTACNNNNCATGNNNNGTAC",
choices=list(first=known1, second=known2))
out
## class: SummarizedExperiment
## dim: 16 1
## metadata(0):
## assays(1): counts
## rownames(16): BARCODE_1 BARCODE_2 ... BARCODE_15 BARCODE_16
## rowData names(2): first second
## colnames(1): file121b696dff67.fastq
## colData names(3): paths nreads nmapped
This function yields a SummarizedExperiment
object
containing a matrix of frequencies of each barcode (row) in each file
(column).
## file121b696dff67.fastq
## BARCODE_1 67
## BARCODE_2 59
## BARCODE_3 59
## BARCODE_4 58
## BARCODE_5 56
## BARCODE_6 55
## BARCODE_7 67
## BARCODE_8 78
## BARCODE_9 57
## BARCODE_10 70
## BARCODE_11 65
## BARCODE_12 62
## BARCODE_13 53
## BARCODE_14 64
## BARCODE_15 68
## BARCODE_16 62
The identities of the variable regions in each combinatorial barcode
are specified in the rowData
, which contains the variable
sequences in known1
and known2
that define
each barcode.
## DataFrame with 16 rows and 2 columns
## first second
## <character> <character>
## BARCODE_1 AAAA ATTA
## BARCODE_2 AAAA CGGC
## BARCODE_3 AAAA GCCG
## BARCODE_4 AAAA TAAT
## BARCODE_5 CCCC ATTA
## ... ... ...
## BARCODE_12 GGGG TAAT
## BARCODE_13 TTTT ATTA
## BARCODE_14 TTTT CGGC
## BARCODE_15 TTTT GCCG
## BARCODE_16 TTTT TAAT
Another experimental design involves the use of dual barcodes, often to modulate the expression of two genes at once. The aim is to, again, count the frequency of each combination of barcodes among the observed read pairs. This differs from the combinatorial barcodes in that the dual barcodes are present on paired reads, with one barcode per read; in addition, not all combinations of two barcodes may be valid.
# Example of a dual barcode:
# READ 1:
CAGCTANNNNNNNNNNCACG
------^^^^^^^^^^----
variable
# READ 2:
CACGGTTNNNNNNNNNNCAGC
-------^^^^^^^^^^----
variable
To demonstrate, let’s mock up some FASTQ files of paired-end read data. Again, for simplicity, we will assume that each read sequence contains only the barcode.
# Creating an example dual barcode sequencing experiment.
known.pool1 <- c("AGAGAGAGA", "CTCTCTCTC",
"GTGTGTGTG", "CACACACAC")
known.pool2 <- c("ATATATATA", "CGCGCGCGC",
"GAGAGAGAG", "CTCTCTCTC")
# Mocking up the barcode sequences.
N <- 1000
read1 <- sprintf("CAGCTACGTACG%sCCAGCTCGATCG",
sample(known.pool1, N, replace=TRUE))
names(read1) <- seq_len(N)
read2 <- sprintf("TGGGCAGCGACA%sACACGAGGGTAT",
sample(known.pool2, N, replace=TRUE))
names(read2) <- seq_len(N)
# Writing them to FASTQ files.
tmp <- tempfile()
tmp1 <- paste0(tmp, "_1.fastq")
writeXStringSet(DNAStringSet(read1), filepath=tmp1, format="fastq")
tmp2 <- paste0(tmp, "_2.fastq")
writeXStringSet(DNAStringSet(read2), filepath=tmp2, format="fastq")
Let us imagine that only a subset of the barcode combinations are actually valid. This is often the case because, in these studies, the combinations are explicitly designed to refer to known phenomena (e.g., gene combinations) rather than being randomized as described for combinatorial barcodes.
choices <- expand.grid(known.pool1, known.pool2)
choices <- DataFrame(barcode1=choices[,1], barcode2=choices[,2])
choices <- choices[sample(nrow(choices), nrow(choices)*0.9),]
We quantify dual barcodes across one or more pairs of files
using the matrixOfDualBarcodes()
function. This requires
templates for the barcode structure to specify how the variable
sequences are used to construct each final barcode; the first template
corresponds to the first barcode, while the second template corresponds
to the second barcode. Many of the options available for single barcode
matching in matrixOfSingleBarcodes()
are also available
here and can be specificed separately for each barcode.
out <- matrixOfDualBarcodes(list(c(tmp1, tmp2)),
choices=choices,
template=c("CAGCTACGTACGNNNNNNNNNCCAGCTCGATCG",
"TGGGCAGCGACANNNNNNNNNACACGAGGGTAT"))
out
## class: SummarizedExperiment
## dim: 14 1
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(2): barcode1 barcode2
## colnames(1): file121b33bad58e_1.fastq
## colData names(3): paths1 paths2 npairs
This function yields a SummarizedExperiment
object
containing a matrix of frequencies of each barcode pair (row) in each
pair of files (column).
## file121b33bad58e_1.fastq
## [1,] 76
## [2,] 52
## [3,] 60
## [4,] 53
## [5,] 64
## [6,] 56
## [7,] 52
## [8,] 66
## [9,] 60
## [10,] 60
## [11,] 72
## [12,] 63
## [13,] 73
## [14,] 66
Further diagnostics about the mapping are provided in the
colData
:
## DataFrame with 1 row and 3 columns
## paths1 paths2
## <character> <character>
## file121b33bad58e_1.fastq /tmp/RtmplM9VpH/file.. /tmp/RtmplM9VpH/file..
## npairs
## <integer>
## file121b33bad58e_1.fastq 1000
The identities of the variable regions corresponding to each row are
specified in the rowData
, which contains the variable
sequences in barcode1
and barcode1
that define
each barcode combination.
## DataFrame with 14 rows and 2 columns
## barcode1 barcode2
## <factor> <factor>
## 1 AGAGAGAGA CGCGCGCGC
## 2 CACACACAC ATATATATA
## 3 GTGTGTGTG CTCTCTCTC
## 4 CTCTCTCTC ATATATATA
## 5 CTCTCTCTC GAGAGAGAG
## ... ... ...
## 10 CTCTCTCTC CGCGCGCGC
## 11 GTGTGTGTG GAGAGAGAG
## 12 AGAGAGAGA ATATATATA
## 13 AGAGAGAGA CTCTCTCTC
## 14 GTGTGTGTG CGCGCGCGC
By default, we assume that the read sequence in the first FASTQ file
contains the first barcode (i.e., choices[,1]
) and the
second FASTQ file contains the second barcode. If the arrangement is
randomized, we can set randomized=TRUE
to search the first
FASTQ file for the second barcode and vice versa. Note that this will
filter out read pairs that match different valid combinations in both
arrangements.
A variation of the dual barcode design involves both variable regions being on a single read.
# Example of a dual barcode:
CAGCTANNNNNNNNNNCACGCACGGTTNNNNNNNNNNCAGC
------^^^^^^^^^^-----------^^^^^^^^^^----
variable variable
To demonstrate, let’s mock up another FASTQ file of single-end read data. Again, for simplicity, we will assume that each read sequence contains only the barcode.
# Creating an example dual barcode sequencing experiment.
known.pool1 <- c("AGAGAGAGA", "CTCTCTCTC",
"GTGTGTGTG", "CACACACAC")
known.pool2 <- c("ATATATATA", "CGCGCGCGC",
"GAGAGAGAG", "CTCTCTCTC")
# Mocking up the barcode sequences.
N <- 1000
read <- sprintf("CAGCTACGTACG%sCCAGCTCGATCG%sACACGAGGGTAT",
sample(known.pool1, N, replace=TRUE),
sample(known.pool2, N, replace=TRUE))
names(read) <- seq_len(N)
# Writing them to FASTQ files.
tmp <- tempfile(fileext=".fastq")
writeXStringSet(DNAStringSet(read), filepath=tmp, format="fastq")
Let us imagine that only a subset of the barcode combinations are actually valid:
choices <- expand.grid(known.pool1, known.pool2)
choices <- DataFrame(barcode1=choices[,1], barcode2=choices[,2])
choices <- choices[sample(nrow(choices), nrow(choices)*0.9),]
We quantify single-end dual barcodes across files using the
matrixOfDualBarcodesSingleEnd()
function. Many of the
options available for single barcode matching in
matrixOfSingleBarcodes()
are also available here and can be
specificed separately for each barcode.
out <- matrixOfDualBarcodesSingleEnd(tmp, choices=choices,
template="CAGCTACGTACGNNNNNNNNNCCAGCTCGATCGNNNNNNNNNACACGAGGGTAT")
out
## class: SummarizedExperiment
## dim: 14 1
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(2): barcode1 barcode2
## colnames(1): file121b1675b882.fastq
## colData names(2): paths nreads
This function yields a SummarizedExperiment
object
containing a matrix of frequencies of each valid barcode combination
(row) in each file (column).
## file121b1675b882.fastq
## [1,] 64
## [2,] 62
## [3,] 66
## [4,] 59
## [5,] 66
## [6,] 73
## [7,] 66
## [8,] 67
## [9,] 55
## [10,] 58
## [11,] 56
## [12,] 70
## [13,] 47
## [14,] 62
Further diagnostics about the mapping are provided in the
colData
:
## DataFrame with 1 row and 2 columns
## paths nreads
## <character> <integer>
## file121b1675b882.fastq /tmp/RtmplM9VpH/file.. 1000
The identities of the variable regions corresponding to each row are
specified in the rowData
, which contains the variable
sequences in barcode1
and barcode1
that define
each barcode combination.
## DataFrame with 14 rows and 2 columns
## barcode1 barcode2
## <factor> <factor>
## 1 CTCTCTCTC ATATATATA
## 2 CTCTCTCTC CGCGCGCGC
## 3 AGAGAGAGA CGCGCGCGC
## 4 CTCTCTCTC GAGAGAGAG
## 5 GTGTGTGTG ATATATATA
## ... ... ...
## 10 AGAGAGAGA ATATATATA
## 11 GTGTGTGTG GAGAGAGAG
## 12 CACACACAC CGCGCGCGC
## 13 CACACACAC GAGAGAGAG
## 14 CTCTCTCTC CTCTCTCTC
In this barcode design, the variable region is not sampled from a pool of known sequences but is instead randomly synthesized from a nucleotide mixture. We cannot use any of the previous functions as they need a known pool; rather, we need to extract the observed sequences (and their frequencies) from the reads. To demonstrate, let’s mock up a FASTQ file from a single-end sequencing screen with a single random variable region.
# Mocking up a 8-bp random variable region.
N <- 1000
randomized <- lapply(1:N, function(i) {
paste(sample(c("A", "C", "G", "T"), 8, replace=TRUE), collapse="")
})
barcodes <- sprintf("CCCAGT%sGGGATAC", randomized)
names(barcodes) <- sprintf("READ_%i", seq_along(barcodes))
# Writing to a FASTQ file.
single.fq <- tempfile(fileext=".fastq")
writeXStringSet(DNAStringSet(barcodes), file=single.fq, format="fastq")
For this design, we can count the frequency of each observed barcode
using the matrixOfRandomBarcodes()
function. This produces
a SummarizedExperiment
containing a count matrix of
frequencies for each barcode.
library(screenCounter)
out <- matrixOfRandomBarcodes(single.fq,
template="CCCAGTNNNNNNNNGGGATAC")
out
## class: SummarizedExperiment
## dim: 994 1
## metadata(0):
## assays(1): counts
## rownames(994): AAAAAAGT AAAAATCG ... TTTTCTCT TTTTTGCG
## rowData names(1): sequences
## colnames(1): file121b26d43cd0.fastq
## colData names(3): paths nreads nmapped
## file121b26d43cd0.fastq
## AAAAAAGT 1
## AAAAATCG 1
## AAAACGCA 1
## AAAACTAC 1
## AAAAGCGA 1
## AAAAGGTT 1
Mismatch tolerance can be enabled by setting
substitutions
to the desired number of mismatches. This may
improve barcode recovery in the presence of sequencing errors. While
such errors are rare per base2, they will be more likely to occur when
considering the entire length of the barcode.
In practice, mismatch tolerance is turned off by default in all counting functions. This is because many errors are introduced during barcode synthesis, more than those due to library preparation or sequencing. Synthesis errors result in genuinely different barcodes (e.g., guides targeting different regions, or different molecular tags) that should not be counted together. Nonetheless, this setting may be useful for debugging experiments with low mapping rates.
Indels are not tolerated in any matches.
By default, the counting functions will search the read sequence on
both the original strand reported by the sequencer and on the reverse
complement. It is possible to change this behaviour with the
strand=
argument to only search one of the strands. The
most appropriate setting requires knowledge of the sequencing protocol
and the barcode design. For example:
strand="original"
if the barcode is on the forward strand
of the reachable end, or strand="reverse"
if it is on the
reverse strand. If the sequencing is performed from a randomly chosen
end, only 50% of the reads will contain barcodes.strand="both"
to search both strands of each read. This
avoids loss of 50% of reads provided that the constant regions are
captured by reads from both ends.Paired-end sequencing of single or combinatorial barcodes requires some more consideration:
strand="original"
or strand="reverse"
.strand="original"
or strand="reverse"
and sum
the counts. (We assume that the chance of randomly encountering the
barcode sequence and its flanking constant regions is negligible, so
each read pair will not contribute more than once to the final
count.)strand="both"
. Alternatively, users can
consolidate read pairs into a consensus read3 prior to using
strand="both"
.For paired-end sequencing of dual barcodes, each paired barcode is
covered by exactly one paired read in the same direction. This means
that we should never have to set strand="both"
as the
strand orientation should be known ahead of time. The most that should
need to be done is to set strand
separately for each
barcode, which is achieved by passing a vector of length 2 to specify
the orientation for the first and second barcodes, respectively; or to
set randomized=TRUE
as described above.
Counting functions are parallelized across files using the BiocParallel framework. Each file is processed separately on a single worker to create the final matrix of counts.
# Pretend that these are different samples:
all.files <- c(single.fq, single.fq, single.fq)
# Parallel execution:
library(BiocParallel)
out <- matrixOfSingleBarcodes(all.files,
flank5="GTAC", flank3="CATG", choices=known,
BPPARAM=SnowParam(2))
out
## class: SummarizedExperiment
## dim: 4 3
## metadata(0):
## assays(1): counts
## rownames(4): AAAAAAAA CCCCCCCC GGGGGGGG TTTTTTTT
## rowData names(1): choices
## colnames(3): file121b26d43cd0.fastq file121b26d43cd0.fastq
## file121b26d43cd0.fastq
## colData names(3): paths nreads nmapped
Users should read the relevant
documentation to determine the most effective parallelization
approach for their system. For example, users working on a cluster
should use BatchToolsParam()
, while users on a local
non-Windows machines may prefer to use
MulticoreParam()
.
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] BiocParallel_1.41.0 screenCounter_1.7.1
## [3] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [5] GenomicRanges_1.59.1 MatrixGenerics_1.19.0
## [7] matrixStats_1.4.1 Biostrings_2.75.2
## [9] GenomeInfoDb_1.43.2 XVector_0.47.0
## [11] IRanges_2.41.2 S4Vectors_0.45.2
## [13] BiocGenerics_0.53.3 generics_0.1.3
## [15] BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-1 jsonlite_1.8.9 compiler_4.4.2
## [4] BiocManager_1.30.25 crayon_1.5.3 Rcpp_1.0.13-1
## [7] parallel_4.4.2 jquerylib_0.1.4 yaml_2.3.10
## [10] fastmap_1.2.0 lattice_0.22-6 R6_2.5.1
## [13] S4Arrays_1.7.1 knitr_1.49 DelayedArray_0.33.3
## [16] snow_0.4-4 maketools_1.3.1 GenomeInfoDbData_1.2.13
## [19] bslib_0.8.0 rlang_1.1.4 cachem_1.1.0
## [22] xfun_0.49 sass_0.4.9 sys_3.4.3
## [25] SparseArray_1.7.2 cli_3.6.3 zlibbioc_1.52.0
## [28] grid_4.4.2 digest_0.6.37 lifecycle_1.0.4
## [31] evaluate_1.0.1 codetools_0.2-20 buildtools_1.0.0
## [34] abind_1.4-8 rmarkdown_2.29 httr_1.4.7
## [37] tools_4.4.2 htmltools_0.5.8.1 UCSC.utils_1.3.0