Vignette illustrating the usage of gscreend on simulated data

Introduction

Pooled CRISPR perturbations screens employ a library of guide RNAs (gRNAs) that is transduced into a pool of cells with the aim to induce a single genetic perturbation in each cell. The perturbation effect is assessed by measuring the abundance of each gRNA after the screen selection phase and comparing it to its abundance in the plasmid library. The main goal of the following analysis is the detection of essential genes, i.e. genes whose knockout reduces the cell fitness. The package gscreend provides a method to rank genes based on count tables.

gscreend workflow

In order to identify essential genes starting from raw gRNA count data, gscreend performs the following analysis steps:

  1. Input of raw gRNA counts at T0 (sequencing of library) and T1 (at the end of the screen). Normalization and calculation of log fold changes.

  2. Split log fold changes into intervals dependent on the initial count at T0.

  3. For every interval fit a skew-normal distribution to the data to model the null hypothesis (via least quantile regression).

  4. Based on the null model calculate p-values for every gRNA.

  5. Rank gRNAs according to p-value and perform robust ranking aggregation to calculate p-values on gene level.

  6. Perform quality control of data and statistical model.

Installation

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("gscreend")
library(gscreend)
library(SummarizedExperiment)

Analysis of simulated data with gscreend

Input data: gRNA counts

The simulated data used in this example has been generated using the simulation method available at https://github.com/imkeller/simulate_pooled_screen

Raw count data consists of gRNA counts in the library sequencing and different replicates after the screen proliferation phase. In order to estimate the effect of a specific gRNA on cell fitness, the relative abundances of the gRNA before and after the proliferation phase will be compares

raw_counts <- read.table(
    system.file("extdata", "simulated_counts.txt", package = "gscreend"),
    header=TRUE)

Generate a summarized experiment from the count data. gscreend currently uses SummarizedExperiment objects as an input format.

The count matrix contains raw gRNA counts. Each row represents one gRNA, each column represents one sample (T0, T1, replicates, …).

counts_matrix <- cbind(raw_counts$library0, 
                        raw_counts$R0_0, 
                        raw_counts$R1_0)

rowData <- data.frame(sgRNA_id = raw_counts$sgrna_id,
                    gene = raw_counts$Gene)

colData <- data.frame(samplename = c("library", "R1", "R2"),
                    # timepoint naming convention: 
                    # T0 -> reference, 
                    # T1 -> after proliferation
                    timepoint = c("T0", "T1", "T1"))

se <- SummarizedExperiment(assays=list(counts=counts_matrix),
                    rowData=rowData, colData=colData)

Run gscreend

In this step a gscreend experiment object is generated that will after the analysis contain all data related to gRNAs, genes and model parameters.

pse <- createPoolScreenExp(se)
## Creating PoolScreenExp object from a SummarizedExperiment object.
## References and samples are named correctly.
## Data concerning sgRNA and genes is provided.

Run gscreend with default parameters.

pse_an <- RunGscreend(pse)
## Size normalized count data.
## Calculated LFC.
## Fitted null distribution.
## Calculated p-values at gRNA level.
## Ranking genes...
## ... for positive fold changes
## ... for negative fold changes
## gscreend analysis has been completed.

Quality control

gscreend provides basic quality control functions for inspection of replicate correlation for example.

plotReplicateCorrelation(pse_an)

The plotModelParameters() function can be used to inspect the values of the parameters estimated for the skew normal distribution of the logarithmic fold change data.

plotModelParameters(pse_an) 

Results

The ResultsTable function can be used to extract a table listing for each gene the p-value and fdr. These values correspond to the results from the statistical test indicting whether upon perturbation a specific gene reduces (direction = “negative”), or increasing (direction = “positive”) cell viability.

res <- ResultsTable(pse_an, direction = "negative")
head(res)
##                          Name          fdr    pval       lfc
## essential_0       essential_0 0.0006150062 0.00004 -1.421372
## essential_1       essential_1 0.0000000000 0.00000 -1.984253
## essential_10     essential_10 0.0004752852 0.00003 -1.605037
## essential_100   essential_100 0.0000000000 0.00000 -3.075151
## essential_1000 essential_1000 0.0000000000 0.00000 -2.023614
## essential_1001 essential_1001 0.0552690583 0.00493 -1.032688

Session Info

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SummarizedExperiment_1.37.0 Biobase_2.67.0             
##  [3] GenomicRanges_1.59.1        GenomeInfoDb_1.43.2        
##  [5] IRanges_2.41.2              S4Vectors_0.45.2           
##  [7] BiocGenerics_0.53.3         generics_0.1.3             
##  [9] MatrixGenerics_1.19.0       matrixStats_1.4.1          
## [11] gscreend_1.21.0             BiocStyle_2.35.0           
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9              SparseArray_1.7.2       fGarch_4033.92         
##  [4] lattice_0.22-6          digest_0.6.37           evaluate_1.0.1         
##  [7] grid_4.4.2              fastmap_1.2.0           jsonlite_1.8.9         
## [10] Matrix_1.7-1            BiocManager_1.30.25     httr_1.4.7             
## [13] gbutils_0.5             UCSC.utils_1.3.0        codetools_0.2-20       
## [16] jquerylib_0.1.4         abind_1.4-8             Rdpack_2.6.2           
## [19] cli_3.6.3               timeSeries_4041.111     rlang_1.1.4            
## [22] crayon_1.5.3            rbibutils_2.3           XVector_0.47.1         
## [25] cachem_1.1.0            DelayedArray_0.33.3     yaml_2.3.10            
## [28] S4Arrays_1.7.1          cvar_0.5                parallel_4.4.2         
## [31] tools_4.4.2             BiocParallel_1.41.0     nloptr_2.1.1           
## [34] GenomeInfoDbData_1.2.13 buildtools_1.0.0        R6_2.5.1               
## [37] lifecycle_1.0.4         bslib_0.8.0             xfun_0.49              
## [40] sys_3.4.3               knitr_1.49              spatial_7.3-17         
## [43] htmltools_0.5.8.1       fBasics_4041.97         rmarkdown_2.29         
## [46] maketools_1.3.1         timeDate_4041.110       compiler_4.4.2