Note: if you use MAGeCKFlute in published research, please cite: Binbin Wang, Mei Wang, Wubing Zhang. “Integrative analysis of pooled CRISPR genetic screens using MAGeCKFlute.” Nature Protocols (2019), doi: 10.1038/s41596-018-0113-7.
if(!"MAGeCKFlute" %in% installed.packages()) BiocManager::install("MAGeCKFlute")
if(!"clusterProfiler" %in% installed.packages()) BiocManager::install("clusterProfiler")
if(!"ggplot2" %in% installed.packages()) BiocManager::install("ggplot2")
library(MAGeCKFlute)
library(clusterProfiler)
library(ggplot2)
The MAGeCK (mageck test
) uses Robust Rank Aggregation
(RRA) for robust identification of CRISPR-screen hits, and outputs the
summary results at both sgRNA and gene level. Before performing the
downstream analysis, please make sure you have got the gene summary and
sgRNA summary results from mageck test
. MAGeCKFlute
incorporates an example datasets (Deng Pan
2018), which study the gene functions in T cell mediate tumor
killing by performing CRISPR screens in B16F10 cell lines co-cultured
with Pmel-1 T cells. There are three samples in the data, including
Pmel1_Input (B16F10 cells without T cell co-culture), Pmel1_Ctrl (B16F10
cells co-cultured with control T cells), and Pmel1 (B16F10 cells
co-cultured with antigen specific T cells). We compared
Pmel1
with Pmel1_Ctrl
using
mageck test
, which identifies genes associated with T cell
mediated tumor killing.
MAGeCKFlute processes MAGeCK RRA results (“gene_summary” and “sgrna_summary”) with the function FluteRRA, which identifies positively and negatively selected genes and performs functional analysis.
## path to the gene summary file (required)
file1 = file.path(system.file("extdata", package = "MAGeCKFlute"),
"testdata/rra.gene_summary.txt")
## path to the sgRNA summary file (optional)
file2 = file.path(system.file("extdata", package = "MAGeCKFlute"),
"testdata/rra.sgrna_summary.txt")
# Run FluteRRA with only gene summary file
FluteRRA(file1, proj="Pmel1", organism="mmu", outdir = "./")
# Run FluteRRA with both gene summary file and sgRNA summary file
FluteRRA(file1, file2, proj="Pmel1", organism="mmu", outdir = "./")
incorporateDepmap
will allow users to compare the
selected genes in the input data and Depmap data, which could clean the
selected genes by eliminating lethal genes (false positive hits in some
CRISPR screens) and then perform downstream enrichment analysis. You can
set the incorporateDepmap
to be TRUE
if you
want to include this analysis. Users could specify Depmap data of
specific cell lines by setting the cell_lines
and
lineages
.omitEssential
allows users to eliminate all common
essential genes selected in the Depmap from all of the downstream
analysis.Hints: all figures and intermediate data are saved into local directory “./MAGeCKFlute_Test/”, and all figures are integrated into file “FluteRRA_Test.pdf”.
For more available parameters in FluteRRA
, please read
the documentation using the command ?FluteRRA
.
The MAGeCK-VISPR (mageck mle
) utilizes a maximum
likelihood estimation (MLE) for robust identification of CRISPR screen
hits. It outputs beta scores and the associated
statistics in multiple conditions. The beta score
describes how a gene is selected: a positive beta score indicates
positive selection, and a negative beta score indicates negative
selection. Before using FluteMLE
, you should first get gene
summary result from MAGeCK-VISPR (mageck mle
). MAGeCKFlute
uses the same dataset as before for demonstration. Using
mageck mle
, we removed the baseline effect (plasmid sample)
from all the three samples, including Pmel1_Input (B16F10 cells without
T cell co-culture), Pmel1_Ctrl (B16F10 cells co-cultured with control T
cells), and Pmel1 (B16F10 cells co-cultured with antigen specific T
cells).
MAGeCKFlute processes MAGeCK MLE results (“gene_summary”) with the function FluteMLE, which performs QC and data normalization based on the beta score, identifies essential genes by comparing control and treatment samples, and finally performs functional enrichment analysis.
incorporateDepmap
will allow users to compare the
selected genes in the input data and Depmap data, which could clean the
selected genes by eliminating lethal genes (false positive hits in some
CRISPR screens) and then perform downstream enrichment analysis. You can
set the incorporateDepmap
to be TRUE
if you
want to include this analysis. Users could specify Depmap data of
specific cell lines by setting the cell_lines
and
lineages
.## Take Depmap screen as control
FluteMLE(gdata, treatname="Pmel1_Ctrl", ctrlname="Depmap", proj="Pmel1", organism="mmu", incorporateDepmap = TRUE)
omitEssential
allows users to eliminate all common
essential genes selected in the Depmap from all of the downstream
analysis.FluteMLE(gdata, treatname="Pmel1", ctrlname="Pmel1_Ctrl", proj="Pmel1", organism="mmu", omitEssential = TRUE)
Hint: All pipeline results are written into local directory “./MAGeCKFlute_Test/”, and all figures are integrated into file “FluteMLE_Test.pdf”.
For more available parameters in FluteMLE
, please read
the documentation using the command ?FluteMLE
.
MAGeCK/MAGeCK-VISPR outputs a count summary file, which summarizes some basic QC scores at raw count level, including map ratio, Gini index, and NegSelQC. MAGeCKFlute incorporates an example datasets (Ophir Shalem* 2014) for demonstration.
file4 = file.path(system.file("extdata", package = "MAGeCKFlute"),
"testdata/countsummary.txt")
countsummary = read.delim(file4, check.names = FALSE)
head(countsummary)
## File Label Reads Mapped Percentage
## 1 ../data/GSC_0131_Day23_Rep1.fastq.gz day23_r1 62818064 39992777 0.6366
## 2 ../data/GSC_0131_Day0_Rep2.fastq.gz day0_r2 47289074 31709075 0.6705
## 3 ../data/GSC_0131_Day0_Rep1.fastq.gz day0_r1 51190401 34729858 0.6784
## 4 ../data/GSC_0131_Day23_Rep2.fastq.gz day23_r2 58686580 37836392 0.6447
## TotalsgRNAs Zerocounts GiniIndex NegSelQC NegSelQCPval
## 1 64076 57 0.08510 0 1
## 2 64076 17 0.07496 0 1
## 3 64076 14 0.07335 0 1
## 4 64076 51 0.08587 0 1
## NegSelQCPvalPermutation NegSelQCPvalPermutationFDR NegSelQCGene
## 1 1 1 0
## 2 1 1 0
## 3 1 1 0
## 4 1 1 0
# Gini index
BarView(countsummary, x = "Label", y = "GiniIndex",
ylab = "Gini index", main = "Evenness of sgRNA reads")
# Missed sgRNAs
countsummary$Missed = log10(countsummary$Zerocounts)
BarView(countsummary, x = "Label", y = "Missed", fill = "#394E80",
ylab = "Log10 missed gRNAs", main = "Missed sgRNAs")
# Or
countsummary$Unmapped = countsummary$Reads - countsummary$Mapped
gg = reshape2::melt(countsummary[, c("Label", "Mapped", "Unmapped")], id.vars = "Label")
gg$variable = factor(gg$variable, levels = c("Unmapped", "Mapped"))
gg = gg[order(gg$Label, gg$variable), ]
p = BarView(gg, x = "Label", y = "value", fill = "variable",
position = "stack", xlab = NULL, ylab = "Reads", main = "Map ratio")
p + scale_fill_manual(values = c("#9BC7E9", "#1C6DAB"))
For CRISPR/Cas9 screens with two experimental conditions, MAGeCK-RRA is available for identification of essential genes. In MAGeCK-RRA results, the sgRNA summary and gene summary file summarizes the statistical significance of positive selections and negative selections at sgRNA level and gene level.
file1 = file.path(system.file("extdata", package = "MAGeCKFlute"),
"testdata/rra.gene_summary.txt")
gdata = ReadRRA(file1)
head(gdata)
## id Score FDR
## 1 Cd274 -3.4698 0.002475
## 2 Psmb8 -3.7260 0.002475
## 3 Rela -3.2413 0.008251
## 4 Otulin -3.6018 0.008663
## 5 Ikbkb -3.4256 0.010891
## 6 Tcea1 -3.4236 0.042079
file2 = file.path(system.file("extdata", package = "MAGeCKFlute"),
"testdata/rra.sgrna_summary.txt")
sdata = ReadsgRRA(file2)
head(sdata)
## sgrna Gene LFC FDR
## 1 AAACATATAGTGTACCTCTA Jak1 10.8910 0
## 2 TCCGAACCGAATCATCACTG Jak1 10.9170 0
## 3 TGAATAAATCCATCAGACAG Jak1 10.5970 0
## 4 GGATAGACGCCCAGCCACTG Stat1 9.9921 0
## 5 TTAATGACGAGCTCGTGGAG Stat1 9.2728 0
## 6 GAAAAGCAAGCGTAATCTCC Stat1 8.7931 0
gdata$HumanGene = TransGeneID(gdata$id, fromType = "symbol", toType = "symbol",
fromOrg = "mmu", toOrg = "hsa")
sdata$HumanGene = TransGeneID(sdata$Gene, fromType = "symbol", toType = "symbol",
fromOrg = "mmu", toOrg = "hsa")
Rank all the genes based on their scores and label genes in the rank plot.
gdata$Rank = rank(gdata$Score)
p1 = ScatterView(gdata, x = "Rank", y = "Score", label = "id",
top = 5, auto_cut_y = TRUE, ylab = "Log2FC",
groups = c("top", "bottom"))
print(p1)
## Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Label interested hits using parameter toplabels
(in
ScatterView) and genelist
(in RankView).
ScatterView(gdata, x = "Rank", y = "Score", label = "id", top = 5,
auto_cut_y = TRUE, groups = c("top", "bottom"),
ylab = "Log2FC", toplabels = c("Pbrm1", "Arid2", "Brd7"))
## Warning: ggrepel: 8 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
You can also use the function RankView
to draw the
figure.
geneList= gdata$Score
names(geneList) = gdata$id
p2 = RankView(geneList, top = 5, bottom = 10)
print(p2)
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
Visualize negative and positive selected genes separately.
For more information about functional enrichment analysis in MAGeCKFlute, please read the MAGeCKFlute_enrichment document, in which we introduce all the available options and methods.
The MAGeCK-VISPR (mageck mle
) computes beta scores and
the associated statistics for all genes in multiple conditions. The
beta score describes how the gene is selected: a
positive beta score indicates a positive selection, and a negative beta
score indicates a negative selection. Before using
FluteMLE
, you should first get gene summary result from
MAGeCK-VISPR (mageck mle
). MAGeCKFlute incorporates an
example datasets for demonstration.
Is there batch effects? This is a commonly asked question before
perform later analysis. In our package, we provide
HeatmapView
to ensure whether the batch effect exists in
data and use BatchRemove
to remove easily if same batch
samples cluster together.
##Before batch removal
edata = matrix(c(rnorm(2000, 5), rnorm(2000, 8)), 1000)
colnames(edata) = paste0("s", 1:4)
HeatmapView(cor(edata))
## After batch removal
batchMat = data.frame(sample = colnames(edata), batch = rep(1:2, each = 2))
edata1 = BatchRemove(edata, batchMat)
head(edata1$data)
## s1 s2 s3 s4
## [1,] 7.082637 5.819096 6.872381 5.732671
## [2,] 5.411736 6.958999 6.212868 5.979145
## [3,] 6.356394 7.121785 6.242167 7.024653
## [4,] 6.767546 7.153993 7.137302 6.922231
## [5,] 5.136134 7.681344 6.427540 5.876513
## [6,] 6.947032 6.196381 6.419730 6.761133
file3 = file.path(system.file("extdata", package = "MAGeCKFlute"),
"testdata/mle.gene_summary.txt")
# Read and visualize the file format
gdata = ReadBeta(file3)
head(gdata)
## Gene Pmel1_Input Pmel1_Ctrl Pmel1
## 1 Defb34 0.087884 -0.010034 -0.068015
## 2 Mndal 0.282860 0.033850 -0.161790
## 3 Cox8c 0.212140 -0.031618 -0.491360
## 4 Poldip3 0.125260 -0.123200 -0.450270
## 5 Bcas2 -0.196670 0.031254 -0.457530
## 6 Klk1b9 -0.220270 -0.019697 -0.443300
It is difficult to control all samples with a consistent cell cycle in a CRISPR screen experiment with multi conditions. Besides, beta score among different states with an inconsistent cell cycle is incomparable. So it is necessary to do the normalization when comparing the beta scores in different conditions. Essential genes are those genes that are indispensable for its survival. The effect generated by knocking out these genes in different cell types is consistent. Based on this, we developed the cell cycle normalization method to shorten the gap of the cell cycle in different conditions.
ctrlname = "Pmel1_Ctrl"
treatname = "Pmel1"
gdata$HumanGene = TransGeneID(gdata$Gene, fromType = "symbol", toType = "symbol",
fromOrg = "mmu", toOrg = "hsa")
gdata_cc = NormalizeBeta(gdata, id = "HumanGene", samples=c(ctrlname, treatname),
method="cell_cycle")
head(gdata_cc)
## Gene Pmel1_Input Pmel1_Ctrl Pmel1 HumanGene
## 1 Defb34 0.087884 -0.06263323 -0.2425685 DEFB106A
## 2 Mndal 0.282860 0.21129508 -0.5770074 IFI16
## 3 Cox8c 0.212140 -0.19736271 -1.7523850 COX8C
## 4 Poldip3 0.125260 -0.76902670 -1.6058418 POLDIP3
## 5 Bcas2 -0.196670 0.19509059 -1.6317338 BCAS2
## 6 Klk1b9 -0.220270 -0.12295064 -1.5809840 KLK3
After normalization, the distribution of beta scores in different conditions should be similar. We can evaluate the distribution of beta scores using the function ‘DensityView’, and ‘ConsistencyView’.
gdata_cc$Control = rowMeans(gdata_cc[,ctrlname, drop = FALSE])
gdata_cc$Treatment = rowMeans(gdata_cc[,treatname, drop = FALSE])
p1 = ScatterView(gdata_cc, "Control", "Treatment", label = "Gene",
auto_cut_diag = TRUE, display_cut = TRUE,
groups = c("top", "bottom"),
toplabels = c("Pbrm1", "Brd7", "Arid2", "Jak1", "Stat1", "B2m"))
print(p1)
## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
gdata_cc$Diff = gdata_cc$Treatment - gdata_cc$Control
gdata_cc$Rank = rank(gdata_cc$Diff)
p1 = ScatterView(gdata_cc, x = "Rank", y = "Diff", label = "Gene",
top = 5, auto_cut_y = TRUE, groups = c("top", "bottom"))
print(p1)
## Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# Or
rankdata = gdata_cc$Treatment - gdata_cc$Control
names(rankdata) = gdata_cc$Gene
RankView(rankdata)
## Warning: No shared levels found between `names(values)` of the manual scale and the
## data's fill values.
p1 = ScatterView(gdata_cc, x="Pmel1_Ctrl", y="Pmel1", label = "Gene",
model = "ninesquare", top = 5, display_cut = TRUE, force = 2)
print(p1)
## Warning: ggrepel: 10 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Customize the cutoff
p1 = ScatterView(gdata_cc, x="Pmel1_Ctrl", y="Pmel1", label = "Gene",
model = "ninesquare", top = 5, display_cut = TRUE,
x_cut = c(-1,1), y_cut = c(-1,1))
print(p1)
## Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Or
# 9-square groups
Square9 = p1$data
idx=Square9$group=="topcenter"
geneList = Square9$Diff[idx]
names(geneList) = Square9$Gene[idx]
universe = Square9$Gene
# Enrichment analysis
kegg1 = EnrichAnalyzer(geneList = geneList, universe = universe)
EnrichedView(kegg1, top = 6, bottom = 0)
Also, pathway visualization can be done using function
KeggPathwayView
(Luo et al.
2013).
gdata_cc$Entrez = TransGeneID(gdata_cc$Gene, "symbol", "entrez", organism = "mmu")
genedata = gdata_cc[, c("Entrez", "Control","Treatment")]
arrangePathview(genedata, pathways = "mmu04630", organism = "mmu", sub = NULL)
## Warning in data(bods): data set 'bods' not found
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.5.1 clusterProfiler_4.13.3 MAGeCKFlute_2.9.0
## [4] BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 sys_3.4.2 jsonlite_1.8.8
## [4] magrittr_2.0.3 farver_2.1.2 rmarkdown_2.28
## [7] fs_1.6.4 zlibbioc_1.51.1 vctrs_0.6.5
## [10] memoise_2.0.1 RCurl_1.98-1.16 ggtree_3.13.1
## [13] htmltools_0.5.8.1 AnnotationHub_3.13.3 curl_5.2.2
## [16] gridGraphics_0.5-1 depmap_1.19.1 sass_0.4.9
## [19] bslib_0.8.0 plyr_1.8.9 httr2_1.0.4
## [22] cachem_1.1.0 buildtools_1.0.0 igraph_2.0.3
## [25] lifecycle_1.0.4 pkgconfig_2.0.3 gson_0.1.0
## [28] Matrix_1.7-0 R6_2.5.1 fastmap_1.2.0
## [31] MatrixGenerics_1.17.0 GenomeInfoDbData_1.2.12 digest_0.6.37
## [34] aplot_0.2.3 enrichplot_1.25.2 colorspace_2.1-1
## [37] patchwork_1.2.0 AnnotationDbi_1.67.0 S4Vectors_0.43.2
## [40] pathview_1.45.0 ExperimentHub_2.13.1 RSQLite_2.3.7
## [43] org.Hs.eg.db_3.19.1 labeling_0.4.3 filelock_1.0.3
## [46] fansi_1.0.6 mgcv_1.9-1 httr_1.4.7
## [49] polyclip_1.10-7 compiler_4.4.1 bit64_4.0.5
## [52] withr_3.0.1 BiocParallel_1.39.0 viridis_0.6.5
## [55] DBI_1.2.3 highr_0.11 ggforce_0.4.2
## [58] R.utils_2.12.3 MASS_7.3-61 rappdirs_0.3.3
## [61] tools_4.4.1 scatterpie_0.2.4 ape_5.8
## [64] msigdbr_7.5.1 R.oo_1.26.0 glue_1.7.0
## [67] nlme_3.1-166 GOSemSim_2.31.2 shadowtext_0.1.4
## [70] grid_4.4.1 reshape2_1.4.4 sva_3.53.0
## [73] fgsea_1.31.0 generics_0.1.3 gtable_0.3.5
## [76] R.methodsS3_1.8.2 tidyr_1.3.1 data.table_1.16.0
## [79] tidygraph_1.3.1 utf8_1.2.4 XVector_0.45.0
## [82] BiocGenerics_0.51.1 ggrepel_0.9.6 BiocVersion_3.20.0
## [85] pillar_1.9.0 stringr_1.5.1 babelgene_22.9
## [88] limma_3.61.9 yulab.utils_0.1.7 genefilter_1.87.0
## [91] splines_4.4.1 dplyr_1.1.4 tweenr_2.0.3
## [94] treeio_1.29.1 BiocFileCache_2.13.0 lattice_0.22-6
## [97] survival_3.7-0 bit_4.0.5 annotate_1.83.0
## [100] tidyselect_1.2.1 locfit_1.5-9.10 GO.db_3.19.1
## [103] maketools_1.3.0 Biostrings_2.73.1 knitr_1.48
## [106] gridExtra_2.3 IRanges_2.39.2 edgeR_4.3.14
## [109] stats4_4.4.1 xfun_0.47 graphlayouts_1.1.1
## [112] Biobase_2.65.1 statmod_1.5.0 matrixStats_1.4.1
## [115] pheatmap_1.0.12 KEGGgraph_1.65.0 stringi_1.8.4
## [118] UCSC.utils_1.1.0 lazyeval_0.2.2 ggfun_0.1.6
## [121] yaml_2.3.10 evaluate_0.24.0 codetools_0.2-20
## [124] ggraph_2.2.1 tibble_3.2.1 qvalue_2.37.0
## [127] Rgraphviz_2.49.0 BiocManager_1.30.25 graph_1.83.0
## [130] ggplotify_0.1.2 cli_3.6.3 xtable_1.8-4
## [133] munsell_0.5.1 jquerylib_0.1.4 Rcpp_1.0.13
## [136] GenomeInfoDb_1.41.1 dbplyr_2.5.0 png_0.1-8
## [139] XML_3.99-0.17 parallel_4.4.1 blob_1.2.4
## [142] DOSE_3.99.1 bitops_1.0-8 viridisLite_0.4.2
## [145] tidytree_0.4.6 scales_1.3.0 purrr_1.0.2
## [148] crayon_1.5.3 rlang_1.1.4 cowplot_1.1.3
## [151] fastmatch_1.1-4 KEGGREST_1.45.1