Welcome to the
amplican
package. This vignette will walk you through our
main package usage with example MiSeq dataset. You will learn how to
interpret results, create summary reports and plot deletions, insertions
and mutations with our functions. This package, amplican
,
is created for fast and precise analysis of CRISPR experiments.
amplican
creates reports of deletions, insertions,
frameshifts, cut rates and other metrics in knitable to HTML
style. amplican
uses many Bioconductor
and
CRAN
packages, and is high level package with purpose to
align your fastq samples and automate analysis across different
experiments. amplican
maintains elasticity through
configuration file, which, with your fastq samples are the only
requirements.
For those inpatient of you, who want to see an example of our whole pipeline analysis on attached example data look here. Below you will find the conceptual map of amplican.
Below you will find the amplicanConsensus
rules. That
allow you to understand how ampliCan treats unambiguous forward and
reverse reads. Green color indicates events that will be accepted. When
forward and reverse reads agree, their events are in the same place and
span the same length, we will take forward read event as representative.
In case when events from forward and reverse read don’t agree we select
event from strand with higher alignment score. In situation where one of
the reads is not spanning event in question we consider this event as
real (as we don’t have other information). If both strands cover event
in question, but one strand has no indel, amplicanConsensus
will change behavior according to the strict
parameter.
To successfully run our analysis it is mandatory to have a configuration file. Take a look at our example:
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:generics':
##
## intersect, setdiff, setequal, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff,
## setequal, table, tapply, union, unique, unsplit, which.max,
## which.min
## Loading required package: Biostrings
## Loading required package: S4Vectors
## Loading required package: stats4
## Warning: multiple methods tables found for 'setequal'
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'IRanges'
## Loading required package: XVector
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'XVector'
## Loading required package: GenomeInfoDb
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'GenomeInfoDb'
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'Biostrings'
## Warning: multiple methods tables found for 'setequal'
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
## Loading required package: pwalign
## Warning: replacing previous import 'BiocGenerics::setequal' by
## 'S4Vectors::setequal' when loading 'pwalign'
##
## Attaching package: 'pwalign'
## The following objects are masked from 'package:Biostrings':
##
## PairwiseAlignments, PairwiseAlignmentsSingleSubject, aligned,
## alignedPattern, alignedSubject, compareStrings, deletion,
## errorSubstitutionMatrices, indel, insertion, mismatchSummary,
## mismatchTable, nedit, nindel, nucleotideSubstitutionMatrix,
## pairwiseAlignment, pid, qualitySubstitutionMatrices, stringDist,
## unaligned, writePairwiseAlignments
## Loading required package: data.table
##
## Attaching package: 'data.table'
## The following object is masked from 'package:IRanges':
##
## shift
## The following objects are masked from 'package:S4Vectors':
##
## first, second
## version: 1.29.0
## Please consider supporting this software by citing:
##
## Labun et al. 2019
## Accurate analysis of genuine CRISPR editing events with ampliCan.
## Genome Res. 2019 Mar 8
## doi: 10.1101/gr.244293.118
ID | Barcode | Forward_Reads | Reverse_Reads | Group | Control | guideRNA | Forward_Primer | Reverse_Primer | Direction | Amplicon | Donor |
---|---|---|---|---|---|---|---|---|---|---|---|
ID_1 | barcode_1 | R1_001.fastq | R2_001.fastq | Betty | 0 | AGGTGGTCAGGGAACTGG | AAGCTGACGGCTAAATGA | AATTACACAAGCGCAAACACAC | 0 | aagctgacggctaaatgaaaaatgtcaaacatctgttccaggtgctgcgtatgccagggcagaggAGGTGGTCAGGGAACTGGtggaggtcactgggataccctttcttcccacaccaatggggaaaggagtcctgccagatgaccatcccaactgtgttgctgcagccagatccaggtgtgtttgcgcttgtgtaatt | |
ID_2 | barcode_1 | R1_001.fastq | R2_001.fastq | Tom | 0 | TGACCCTCTGCCAACACAAGGGG | TGACCAAACCTTCTTAAGGTGC | CTCTGCTGCAAAATGCAAGG | 1 | aaatactgtcttgtgaccaaaccttcttaaggtgctattttgataataaactttattgtgcttttgtagttgtgCCCCTTGTGTTGGCAGAGGGTCAgcagaccagtaagtcttctcaatttcttttatttatgtgtagtgataaaaaaatgttaaattaaaattaaatgtttttttttgccttgcattttgcagcagaggatgat | |
ID_3 | barcode_2 | R1_002.fastq | R2_002.fastq | Tom | 0 | AGGTGGTCAGGGAACTGG | AAGCTGACGGCTAAATGA | AATTACACAAGCGCAAACACAC | 0 | aagctgacggctaaatgaaaaatgtcaaacatctgttccaggtgctgcgtatgccagggcagaggAGGTGGTCAGGGAACTGGtggaggtcactgggataccctttcttcccacaccaatggggaaaggagtcctgccagatgaccatcccaactgtgttgctgcagccagatccaggtgtgtttgcgcttgtgtaatt | |
ID_4 | barcode_2 | R1_002.fastq | R2_002.fastq | Betty | 0 | GTCCCTGCAACATTAAAGGCCGG | GCTGGCAACATTCCTACCAGT | GAGCGCTGAGGCAGGATTAT | 0 | gctggcaacattcctaccagtaatttacgtaaaaaaatgctataaaatgtgtagctctccagtctaatgtaacttgtgcttgcattgtgtttacaggaaaccaGTCCCTGCAACATTAAAGGCCGGaagtctaagaactcacatcagcaggtgtcaagtgtgcatgaagagggtataatcctgcctcagcgctc | |
ID_5 | barcode_2 | R1_002.fastq | R2_002.fastq | Betty | 0 | GTCCCTGCAACATTAAAGGCCGG | ACTGGCAACATTCCTACCAGT | ACTGGCTGAGGCAGGATTAT | 0 | actggcaacattcctaccagtaatttacgtaaaaaaatgctataaaatgtgtagctctccagtctaatgtaacttgtgcttgcattgtgtttacaggaaaccaGTCCCTGCAACATTAAAGGCCGGaagtctaagaactcacatcagcaggtgtcaagtgtgcatgaagagggtataatcctgcctcagccagt | actggcaacattcctaccagtaatttacgtaaaaaaatgctataaaatgtgtagctctccagtctaatgtaacttgtgcttgcattgtgtttacaggaaaccaGTCCCTGCAACATTAAAAGCCGGaagtctaagaactcacatcagcaggtgtcaagtgtgcatgaagagggtataatcctgcctcagccagt |
Configuration file should be a “,” delimited csv file with information about your experiments. You can find example config file path by running:
Columns of the config file:
amplican
will estimate HDR efficiency based on
the events from aligning donor and amplicon sequences, donor and reads
and reads and amplicon.If you have only forward primers leave column Reverse_Primer empty, leave empty also the Reverse_Reads column. You can still use amplican like normal.
To run amplican
with default options, along with
generation of all posible reports you can use
amplicanPipeline
function. We have already attached results
of the default amplican analysis (look at other vignettes) of the
example dataset, but take a look below at how you can do that yourself.
Be prepared to grab a coffe when running amplicanPipeline
with knit_files = TRUE
as this will take some time. You
will understand it is worth waiting when reports will be ready.
# path to example config file
config <- system.file("extdata", "config.csv", package = "amplican")
# path to example fastq files
fastq_folder <- system.file("extdata", package = "amplican")
# output folder, a full path
results_folder <- tempdir()
# run amplican
amplicanPipeline(config, fastq_folder, results_folder)
# results of the analysis can be found at
message(results_folder)
Take a look into “results_folder” folder. Here you can find
.Rmd
files that are our reports for example dataset. We
already crafted .html
versions and you can find them in the
“reports” folder. Open one of the reports with your favourite browser
now. To zoom on an image just open it in new window (right click ->
open image in new tab).
amplicanPipeline
just crafted very detailed reports for
you, but this is not all, if you need something different e.g. different
plot colours, just change the .Rmd
file and
knit
it again. This way you have all the power over
plotting.
First step of our analysis is to filter out reads that are not complying with our default restrictions:
Barcode | experiment_count | read_count | bad_base_quality | bad_average_quality | bad_alphabet | filtered_read_count | unique_reads | unassigned_reads | assigned_reads |
---|---|---|---|---|---|---|---|---|---|
barcode_1 | 2 | 20 | 0 | 3 | 0 | 17 | 11 | 4 | 7 |
barcode_2 | 3 | 21 | 0 | 0 | 0 | 21 | 9 | 0 | 9 |
This table is also summarized in one of the reports. As you can see
for our example dataset we have two barcodes, to which correspond 21 and
20 reads. Six reads are rejected for barcode_1 due to bad alphabet and
bad average quality. Each of the barcodes has unique reads, which means
forward and reverse reads are compacted when they are identical. There
is 8 and 9 unique reads for each barcode. One read failed with
assignment for barcode_1, you can see this read in the top unassgned
reads for barcode report in human readable form. Normally you will
probably see only half of your reads being assigned to the barcodes.
Reads are assigned when for forward read we can find forward primer and
for reverse read we can find reverse primer. Primers have to be
perfectly matched. Nevertheless, there is option
fastqreads = 0.5
which changes method of assigning reads to
each IDs. With this option specified only one of the reads (forward or
reverse) have to have primer perfectly matched. If you don’t have the
reverse reads or you don’t want to use them you can use option
fastqreads = 1
, this option should be detectd autmatically,
if you leave empty field Reverse_Primer in the config
file.
config_summary.csv
contains extended version of the
config file. It should provide you additional look at raw numbers which
we use for various plots in reports. Take a look at example
extension:
ID | Barcode | Reads | Reads_Filtered | Reads_In | Reads_Del | Reads_Edited | Reads_Frameshifted |
---|---|---|---|---|---|---|---|
ID_1 | barcode_1 | 7 | 6 | 0 | 6 | 6 | 2 |
ID_2 | barcode_1 | 6 | 6 | 0 | 6 | 6 | 4 |
ID_3 | barcode_2 | 9 | 8 | 1 | 7 | 8 | 1 |
ID_4 | barcode_2 | 7 | 7 | 7 | 4 | 7 | 2 |
ID_5 | barcode_2 | 5 | 5 | 0 | 0 | 2 | 0 |
During amplicanPipeline
these columns are added to the
config file:
amplicanConsensus
File RunParameters.txt lists all options used for the analysis, this file you might find useful when reviewing analysis from the past where you forgot what kind of options you used. Other than that this file has no purpose.
# path to example RunParameters.txt
run_params <- system.file("extdata", "results", "RunParameters.txt",
package = "amplican")
# show contents of the file
readLines(run_params)
## [1] "Config file: full/path/to/config/file/that/has/been/used.csv"
## [2] "Average Quality: 30"
## [3] "Minimum Quality: 0"
## [4] "Filter N-reads: FALSE"
## [5] "Batch size: 1e+07"
## [6] "Write Alignments: None"
## [7] "Fastq files Mode: 0"
## [8] "Gap Opening: 25"
## [9] "Gap Extension: 0"
## [10] "Consensus: TRUE"
## [11] "Normalize: guideRNA, Group"
## [12] "PRIMER DIMER buffer: 30"
## [13] "Cut buffer: 5"
## [14] "Scoring Matrix:"
## [15] ",A,C,G,T,M,R,W,S,Y,K,V,H,D,B,N"
## [16] "A,5,-4,-4,-4,0.5,0.5,0.5,-4,-4,-4,-1,-1,-1,-4,-1.75"
## [17] "C,-4,5,-4,-4,0.5,-4,-4,0.5,0.5,-4,-1,-1,-4,-1,-1.75"
## [18] "G,-4,-4,5,-4,-4,0.5,-4,0.5,-4,0.5,-1,-4,-1,-1,-1.75"
## [19] "T,-4,-4,-4,5,-4,-4,0.5,-4,0.5,0.5,-4,-1,-1,-1,-1.75"
## [20] "M,0.5,0.5,-4,-4,0.5,-1.75,-1.75,-1.75,-1.75,-4,-1,-1,-2.5,-2.5,-1.75"
## [21] "R,0.5,-4,0.5,-4,-1.75,0.5,-1.75,-1.75,-4,-1.75,-1,-2.5,-1,-2.5,-1.75"
## [22] "W,0.5,-4,-4,0.5,-1.75,-1.75,0.5,-4,-1.75,-1.75,-2.5,-1,-1,-2.5,-1.75"
## [23] "S,-4,0.5,0.5,-4,-1.75,-1.75,-4,0.5,-1.75,-1.75,-1,-2.5,-2.5,-1,-1.75"
## [24] "Y,-4,0.5,-4,0.5,-1.75,-4,-1.75,-1.75,0.5,-1.75,-2.5,-1,-2.5,-1,-1.75"
## [25] "K,-4,-4,0.5,0.5,-4,-1.75,-1.75,-1.75,-1.75,0.5,-2.5,-2.5,-1,-1,-1.75"
## [26] "V,-1,-1,-1,-4,-1,-1,-2.5,-1,-2.5,-2.5,-1,-2,-2,-2,-1.75"
## [27] "H,-1,-1,-4,-1,-1,-2.5,-1,-2.5,-1,-2.5,-2,-1,-2,-2,-1.75"
## [28] "D,-1,-4,-1,-1,-2.5,-1,-1,-2.5,-2.5,-1,-2,-2,-1,-2,-1.75"
## [29] "B,-4,-1,-1,-1,-2.5,-2.5,-2.5,-1,-1,-1,-2,-2,-2,-1,-1.75"
## [30] "N,-1.75,-1.75,-1.75,-1.75,-1.75,-1.75,-1.75,-1.75,-1.75,-1.75,-1.75,-1.75,-1.75,-1.75,-1.75"
As name indicates it contains all alignments.
# path to the example alignments folder
system.file("extdata", "results", "alignments", package = "amplican")
In unassigned_reads.csv
you can find detailed
information about unassigned reads. In example dataset there is one
unassigned read.
Take a look at the alignment events file which contains all the
insertions, deletions, cuts and mutations. This file can be used in
various ways. Examples you can find in .Rmd
files we
prepare using amplicanReport
. These can be easily converted
into GRanges
and used for further analysis! Events are
saved at three points of amplicanPipeline
analysis. First
file “raw_events.csv” contains all events directly extracted from
aligned reads. After filtering PRIMER DIMER reads, removing events
overlapping primers (alignment artifacts) and shifting events so that
they are relative to the expected cut sites
“events_filtered_shifted.csv” is saved. After normalization through
amplicanNormalize
“events_filtered_shifted_normalized.csv”
is saved, probably it is the file you should use for further
analysis.
seqnames | start | end | width | strand | originally | replacement | type | read_id | score | counts | readType | overlaps | consensus |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID_1 | 42 | 61 | 20 | + | AAAAAAAAAAAAAAAAAAAA | insertion | 1 | 597 | 3 | FALSE | FALSE | TRUE | |
ID_1 | 46 | 65 | 20 | + | AAAAAAAAAAAAAAAAAAAA | insertion | 2 | 557 | 2 | FALSE | FALSE | TRUE | |
ID_1 | -24 | 41 | 66 | + | deletion | 1 | 597 | 3 | FALSE | TRUE | TRUE | ||
ID_1 | -28 | 45 | 74 | + | deletion | 2 | 557 | 2 | FALSE | TRUE | TRUE | ||
ID_1 | -32 | 51 | 84 | + | deletion | 4 | 532 | 1 | FALSE | TRUE | TRUE | ||
ID_1 | -35 | -35 | 1 | + | A | G | mismatch | 1 | 597 | 3 | FALSE | FALSE | TRUE |
Human readable alignments can be accesed using
lookupAlignment
function of
AlignmentsExperimentSet
object which contains all
information after alignments from multiple experiments. Human readable
format looks like this:
aln <- system.file("extdata", "results", "alignments",
"AlignmentsExperimentSet.rds",
package = "amplican")
aln <- readRDS(aln)
amplican::lookupAlignment(aln, ID = "ID_1") # will print most frequent alignment for ID_1
## ########################################
## # Forward read for ID ID_1 read_id 1
## # Program: pwalign (version 1.3.0), a Bioconductor package
## # Rundate: Sat Nov 2 05:59:59 2024
## ########################################
## #=======================================
## #
## # Aligned sequences:
## # Forward read: P1
## # Amplicon sequence: S1
## # Matrix: NA
## # Gap_penalty: 25.0
## # Extend_penalty: 0.0
## #
## # Length: 219
## # Identity: 131/219 (59.8%)
## # Similarity: NA/219 (NA%)
## # Gaps: 86/219 (39.3%)
## # Score: 597
## #
## #
## #=======================================
##
## P1 1 AAGCTGACGGCTAAATGAAAAATGTCAAACGTCTGTTCCAG--------- 41
## |||||||||||||||||||||||||||||| ||||||||||
## S1 1 AAGCTGACGGCTAAATGAAAAATGTCAAACATCTGTTCCAGGTGCTGCGT 50
##
## P1 42 -------------------------------------------------- 41
##
## S1 51 ATGCCAGGGCAGAGGAGGTGGTCAGGGAACTGGTGGAGGTCACTGGGATA 100
##
## P1 42 -------AAAAAAAAAAAAAAAAAAAATTCCCACACCAATGGGGAAAGGA 84
## |||||||||||||||||||||||
## S1 101 CCCTTTC--------------------TTCCCACACCAATGGGGAAAGGA 130
##
## P1 85 GTCCTGCCAGATGACCATCCCAACTGTGTTGCAGCAGCCAGATCCAGGTG 134
## |||||||||||||||||||||||||||||||| |||||||||||||||||
## S1 131 GTCCTGCCAGATGACCATCCCAACTGTGTTGCTGCAGCCAGATCCAGGTG 180
##
## P1 135 TGTTTGCGCTTGTGTAATT 153
## |||||||||||||||||||
## S1 181 TGTTTGCGCTTGTGTAATT 199
##
##
## #---------------------------------------
## #---------------------------------------
## ########################################
## # Reverse read for ID ID_1 read_id 1
## # Program: pwalign (version 1.3.0), a Bioconductor package
## # Rundate: Sat Nov 2 05:59:59 2024
## ########################################
## #=======================================
## #
## # Aligned sequences:
## # Reverse read: P1
## # Amplicon sequence: S1
## # Matrix: NA
## # Gap_penalty: 25.0
## # Extend_penalty: 0.0
## #
## # Length: 219
## # Identity: 131/219 (59.8%)
## # Similarity: NA/219 (NA%)
## # Gaps: 86/219 (39.3%)
## # Score: 597
## #
## #
## #=======================================
##
## P1 1 AAGCTGACGGCTAAATGAAAAATGTCAAACGTCTGTTCCAG--------- 41
## |||||||||||||||||||||||||||||| ||||||||||
## S1 1 AAGCTGACGGCTAAATGAAAAATGTCAAACATCTGTTCCAGGTGCTGCGT 50
##
## P1 42 -------------------------------------------------- 41
##
## S1 51 ATGCCAGGGCAGAGGAGGTGGTCAGGGAACTGGTGGAGGTCACTGGGATA 100
##
## P1 42 -------AAAAAAAAAAAAAAAAAAAATTCCCACACCAATGGGGAAAGGA 84
## |||||||||||||||||||||||
## S1 101 CCCTTTC--------------------TTCCCACACCAATGGGGAAAGGA 130
##
## P1 85 GTCCTGCCAGATGACCATCCCAACTGTGTTGCAGCAGCCAGATCCAGGTG 134
## |||||||||||||||||||||||||||||||| |||||||||||||||||
## S1 131 GTCCTGCCAGATGACCATCCCAACTGTGTTGCTGCAGCCAGATCCAGGTG 180
##
## P1 135 TGTTTGCGCTTGTGTAATT 153
## |||||||||||||||||||
## S1 181 TGTTTGCGCTTGTGTAATT 199
##
##
## #---------------------------------------
## #---------------------------------------
Reports are automated for the convenience of the users. We provide 6
different reports. Reports are .Rmd files which can be easily crafted
through rmarkdown::render(path_to_report)
or by clicking
Knit
in Rstudio to make HTML version of the report. If you
have run our example analysis, then you can open one of the reports with
Rstudio and try knitting it now! Otherwise you can open one of already
knitted example report in the vignettes.
When you want to have more control over alignments and you need more
advanced options use amplicanAlign
. This function has many
parameters which can be used to bend our pipeline to your needs. Read
more about amplicanAlign
on the help page
?amplican::amplicanAlign
.
Read more about normalization in the description of the
?amplican::amplicanNormalize
or in FAQ. Just note that
default setting of normalization threshold is 1%, if you notice in your
mismatch plots that general level of noise (mismatches outside cut
region) might be higher than that of 1%, you have to adjust threshold
value to higher levels (min_freq input parameter). In case where you
expect index hopping to occour, you can use either
?amplican::amplicanPipelineConservative
or rise the
min_freq above expected index hopping level.
Reports are made for user convenience, they are powerful as they:
We decided to separate reports into 6 different types. Function
amplicanReport
is made for automated report creation.
Types of report:
We provide specialized plots for each type of the alignment events.
plot_mismatches
- plots mismatches as stacked bar-plot
over the amplicon, top plot is for the forward and bottom is for reverse
reads
plot_deletions
- plots deletions as arch-plot from the
ggbio package
plot_insertions
- each insertion is represented by triangle
over the amplicon
plot_cuts
- gathers cuts from each of the experiments and
overlays multiple arch-plots on top of each other, useful when analyzing
what kind of cuts were introduced in the amplicon
plot_variants
- presents most frequent mutations with
frameshift and codon information
plot_heterogeneity
- shows a measure of read
“uniqueness”
You can take a look at all these plots and how to make them in the
example reports. There are also meta versions of above plots, that
discard amplicon information and allow to overlay multiple different
amplicons on top of each other eg. metaplot_deletions
.