Pooled CRISPR perturbations screens employ a library of guide RNAs (gRNAs) that is transduced into a pool of cells with the aim to induce a single genetic perturbation in each cell. The perturbation effect is assessed by measuring the abundance of each gRNA after the screen selection phase and comparing it to its abundance in the plasmid library. The main goal of the following analysis is the detection of essential genes, i.e. genes whose knockout reduces the cell fitness. The package gscreend provides a method to rank genes based on count tables.
In order to identify essential genes starting from raw gRNA count data, gscreend performs the following analysis steps:
Input of raw gRNA counts at T0 (sequencing of library) and T1 (at the end of the screen). Normalization and calculation of log fold changes.
Split log fold changes into intervals dependent on the initial count at T0.
For every interval fit a skew-normal distribution to the data to model the null hypothesis (via least quantile regression).
Based on the null model calculate p-values for every gRNA.
Rank gRNAs according to p-value and perform robust ranking aggregation to calculate p-values on gene level.
Perform quality control of data and statistical model.
The simulated data used in this example has been generated using the simulation method available at https://github.com/imkeller/simulate_pooled_screen
Raw count data consists of gRNA counts in the library sequencing and different replicates after the screen proliferation phase. In order to estimate the effect of a specific gRNA on cell fitness, the relative abundances of the gRNA before and after the proliferation phase will be compares
raw_counts <- read.table(
system.file("extdata", "simulated_counts.txt", package = "gscreend"),
header=TRUE)
Generate a summarized experiment from the count data. gscreend currently uses SummarizedExperiment objects as an input format.
The count matrix contains raw gRNA counts. Each row represents one gRNA, each column represents one sample (T0, T1, replicates, …).
counts_matrix <- cbind(raw_counts$library0,
raw_counts$R0_0,
raw_counts$R1_0)
rowData <- data.frame(sgRNA_id = raw_counts$sgrna_id,
gene = raw_counts$Gene)
colData <- data.frame(samplename = c("library", "R1", "R2"),
# timepoint naming convention:
# T0 -> reference,
# T1 -> after proliferation
timepoint = c("T0", "T1", "T1"))
se <- SummarizedExperiment(assays=list(counts=counts_matrix),
rowData=rowData, colData=colData)
In this step a gscreend experiment object is generated that will after the analysis contain all data related to gRNAs, genes and model parameters.
## Creating PoolScreenExp object from a SummarizedExperiment object.
## References and samples are named correctly.
## Data concerning sgRNA and genes is provided.
Run gscreend with default parameters.
## Size normalized count data.
## Calculated LFC.
## Fitted null distribution.
## Calculated p-values at gRNA level.
## Ranking genes...
## ... for positive fold changes
## ... for negative fold changes
## gscreend analysis has been completed.
gscreend provides basic quality control functions for inspection of replicate correlation for example.
The plotModelParameters()
function can be used to
inspect the values of the parameters estimated for the skew normal
distribution of the logarithmic fold change data.
The ResultsTable function can be used to extract a table listing for each gene the p-value and fdr. These values correspond to the results from the statistical test indicting whether upon perturbation a specific gene reduces (direction = “negative”), or increasing (direction = “positive”) cell viability.
## Name fdr pval lfc
## essential_0 essential_0 0.0003063725 0.00002 -1.421372
## essential_1 essential_1 0.0000000000 0.00000 -1.984253
## essential_10 essential_10 0.0003063725 0.00002 -1.605037
## essential_100 essential_100 0.0000000000 0.00000 -3.075151
## essential_1000 essential_1000 0.0000000000 0.00000 -2.023614
## essential_1001 essential_1001 0.0561659193 0.00501 -1.032688
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SummarizedExperiment_1.37.0 Biobase_2.67.0
## [3] GenomicRanges_1.59.1 GenomeInfoDb_1.43.1
## [5] IRanges_2.41.1 S4Vectors_0.45.2
## [7] BiocGenerics_0.53.3 generics_0.1.3
## [9] MatrixGenerics_1.19.0 matrixStats_1.4.1
## [11] gscreend_1.21.0 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 SparseArray_1.7.2 fGarch_4033.92
## [4] lattice_0.22-6 digest_0.6.37 evaluate_1.0.1
## [7] grid_4.4.2 fastmap_1.2.0 jsonlite_1.8.9
## [10] Matrix_1.7-1 BiocManager_1.30.25 httr_1.4.7
## [13] gbutils_0.5 UCSC.utils_1.3.0 codetools_0.2-20
## [16] jquerylib_0.1.4 Rdpack_2.6.2 abind_1.4-8
## [19] cli_3.6.3 timeSeries_4041.111 rlang_1.1.4
## [22] crayon_1.5.3 rbibutils_2.3 XVector_0.47.0
## [25] cachem_1.1.0 DelayedArray_0.33.2 yaml_2.3.10
## [28] cvar_0.5 S4Arrays_1.7.1 parallel_4.4.2
## [31] tools_4.4.2 BiocParallel_1.41.0 nloptr_2.1.1
## [34] GenomeInfoDbData_1.2.13 buildtools_1.0.0 R6_2.5.1
## [37] lifecycle_1.0.4 zlibbioc_1.52.0 bslib_0.8.0
## [40] xfun_0.49 sys_3.4.3 knitr_1.49
## [43] spatial_7.3-17 htmltools_0.5.8.1 fBasics_4041.97
## [46] rmarkdown_2.29 maketools_1.3.1 timeDate_4041.110
## [49] compiler_4.4.2