Title: | Accurate consensus sequence from nanopore reads of a gene library |
---|---|
Description: | Accurate consensus sequence from nanopore reads of a DNA gene library. SINGLe corrects for systematic errors in nanopore sequencing reads of gene libraries and it retrieves true consensus sequences of variants identified by a barcode, needing only a few reads per variant. More information in preprint doi: https://doi.org/10.1101/2020.03.25.007146. |
Authors: | Rocio Espada [aut, cre] |
Maintainer: | Rocio Espada <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.9.0 |
Built: | 2024-07-02 05:03:54 UTC |
Source: | https://github.com/bioc/single |
Vector ascii
ascii_v
ascii_v
Named character vector of length 94. Names are ascii character and value is the probability of error.
Evaluates SINGLe for pos, nucleotides and QUAL in the given ranges.
evaluate_fits( pos_range, q_range, output_file, data_fits, ref_seq, verbose = FALSE, save = FALSE )
evaluate_fits( pos_range, q_range, output_file, data_fits, ref_seq, verbose = FALSE, save = FALSE )
pos_range |
Numeric vector. Positions to evaluate. |
q_range |
Numeric vector. QUAL to evaluate. |
output_file |
File name for output, if save=TRUE. |
data_fits |
Data.frame with columns position nucleotide slope intercept as the one returned by fit_logregr |
ref_seq |
DNAStringSet containing the true reference sequence. |
verbose |
Logical. |
save |
Logical. Should results be saved in output_file? |
data.frame with SINGLE fits evaluated for pos_range and q_range.
refseq_fasta = system.file("extdata", "ref_seq.fasta", package = "single") ref_seq = Biostrings::readDNAStringSet(refseq_fasta) fits_file <- system.file("extdata","fits_example.txt",package = "single") fits <- read.table(fits_file, header=TRUE) evaluated_fits <- evaluate_fits(pos_range = c(1,5),q_range = c(0,10), data_fits = fits,ref_seq = ref_seq)
refseq_fasta = system.file("extdata", "ref_seq.fasta", package = "single") ref_seq = Biostrings::readDNAStringSet(refseq_fasta) fits_file <- system.file("extdata","fits_example.txt",package = "single") fits <- read.table(fits_file, header=TRUE) evaluated_fits <- evaluate_fits(pos_range = c(1,5),q_range = c(0,10), data_fits = fits,ref_seq = ref_seq)
This is an auxiliary function in single package. It takes counts_pnq and for each position and nucleotide it fits SINGLE's logistic regression.
fit_logregr( counts_pnq, ref_seq, p_prior_errors, p_prior_mutations, save = FALSE, output_file_fits, output_file_data, verbose = FALSE, keep_fit_quality = FALSE )
fit_logregr( counts_pnq, ref_seq, p_prior_errors, p_prior_mutations, save = FALSE, output_file_fits, output_file_data, verbose = FALSE, keep_fit_quality = FALSE )
counts_pnq |
Data frame with columns position nucleoide quality counts, as returned by pileup_by_QUAL |
ref_seq |
DNAStringSet containing the true reference sequence. |
p_prior_errors |
Data frame with columns position nucleotide prior.error, as the one returned by p_prior_errors(). |
p_prior_mutations |
Data frame with columns wt.base, nucleotide and p_mutation (probaility of mutation), as the one returned by p_prior_mutations(). |
save |
Logical. Should data be saved in a output_file? |
output_file_fits |
File into which save the single fits if save=TRUE |
output_file_data |
File into which save the fitted data if save=TRUE |
verbose |
Logical. |
keep_fit_quality |
Logical. Should parameters related to the quality of the fit be returned in extra columns of the output? |
data.frame with columns position, nucleotide, slope and intercept (of the sigmoidal regression).
refseq_fasta <- system.file("extdata", "ref_seq.fasta", package = "single") ref_seq = Biostrings::readDNAStringSet(refseq_fasta) train_reads_example <- system.file("extdata", "train_seqs_500.sorted.bam", package = "single") counts_pnq <- pileup_by_QUAL(bam_file=train_reads_example, pos_start=1,pos_end=10) p_prior_mutations <- p_prior_mutations(rates.matrix = mutation_rate, mean.n.mut = 5,ref_seq = ref_seq) p_prior_errors <- p_prior_errors(counts_pnq=counts_pnq) fits <- fit_logregr(counts_pnq = counts_pnq,ref_seq=ref_seq, p_prior_errors = p_prior_errors,p_prior_mutations = p_prior_mutations)
refseq_fasta <- system.file("extdata", "ref_seq.fasta", package = "single") ref_seq = Biostrings::readDNAStringSet(refseq_fasta) train_reads_example <- system.file("extdata", "train_seqs_500.sorted.bam", package = "single") counts_pnq <- pileup_by_QUAL(bam_file=train_reads_example, pos_start=1,pos_end=10) p_prior_mutations <- p_prior_mutations(rates.matrix = mutation_rate, mean.n.mut = 5,ref_seq = ref_seq) p_prior_errors <- p_prior_errors(counts_pnq=counts_pnq) fits <- fit_logregr(counts_pnq = counts_pnq,ref_seq=ref_seq, p_prior_errors = p_prior_errors,p_prior_mutations = p_prior_mutations)
This is an auxiliary function in single package. It evaluates the sigmoidal function given by the parameters slope and intercept on x.
glm.predict.(x, slope, intercept)
glm.predict.(x, slope, intercept)
x |
Numeric. Values to evaluate. |
slope |
Slope of the sigmoidal function to evaluate. |
intercept |
Intercept of the sigmoidal function to evaluate. |
Numeric.
x = c(-10:10) y = glm.predict.(x,1,2) plot(x,y)
x = c(-10:10) y = glm.predict.(x,1,2) plot(x,y)
This is an auxiliary function in single package, to list the mutations of two DNAstrings.
list_mismatches(ref, seq)
list_mismatches(ref, seq)
ref |
DNAString, reference sequence. |
seq |
DNAString, target sequence, same length as ref. |
Character vector containing Nucleotide in ref Position Nucleotide in seq. If ref and seq are equal, it returns NA.
ref = Biostrings::DNAString("AAAA") seq = Biostrings::DNAString("AGAT") list_mismatches(ref,seq) list_mismatches(ref,ref)
ref = Biostrings::DNAString("AAAA") seq = Biostrings::DNAString("AGAT") list_mismatches(ref,seq) list_mismatches(ref,ref)
Mutational rate matrix for error-prone PCR, obtained from GeneMorph II Random Mutagenesis Kit.
mutation_rate
mutation_rate
matrix size 4x5
https://www.agilent.com/cs/library/usermanuals/public/200550.pdf
This is an auxiliary function in single package. It takes a data frame with counts by position, nucleotide and Qscore and it summarises it into proportion of nucleotide counts by position.
p_prior_errors(counts_pnq, output_file = NULL, save = FALSE)
p_prior_errors(counts_pnq, output_file = NULL, save = FALSE)
counts_pnq |
Data frame with columns position nucleoide quality counts, as returned by parse_countspnq |
output_file |
File name for output, if save=TRUE. |
save |
Logical. Should data be saved in a output_file? |
Data frame with columns position nucleotide prior.error.
refseq_fasta <- system.file("extdata", "ref_seq.fasta", package = "single") train_reads_example <- system.file("extdata", "train_seqs_500.sorted.bam", package = "single") counts_pnq <- pileup_by_QUAL(train_reads_example,pos_start=1,pos_end=10) p_prior_errors <- p_prior_errors(counts_pnq=counts_pnq) head(p_prior_errors)
refseq_fasta <- system.file("extdata", "ref_seq.fasta", package = "single") train_reads_example <- system.file("extdata", "train_seqs_500.sorted.bam", package = "single") counts_pnq <- pileup_by_QUAL(train_reads_example,pos_start=1,pos_end=10) p_prior_errors <- p_prior_errors(counts_pnq=counts_pnq) head(p_prior_errors)
This is an auxiliary function in single package. It computes the prior probability of mutation in a gene library.
p_prior_mutations( rates.matrix, mean.n.mut, ref_seq, save = FALSE, output_file = "tablePriorMutations.txt" )
p_prior_mutations( rates.matrix, mean.n.mut, ref_seq, save = FALSE, output_file = "tablePriorMutations.txt" )
rates.matrix |
Mutation rate matrix: 4x5 matrix, each row/col representing a nucleotide (col adds deletion), and the values is the mutational rate from row to col. |
mean.n.mut |
Mean number of mutations expected (one number). |
ref_seq |
DNAStringSet containing the true reference sequence. |
save |
Logical. Should data be saved in a output_file? |
output_file |
File name for output, if save=TRUE. |
Data frame with columns wt.base (wild type nucleotide), nucleotide (mutated nucleotide), p_mutation (probaility of mutation)
refseq_fasta <- system.file("extdata", "ref_seq.fasta", package = "single") ref_seq <- Biostrings::subseq(Biostrings::readDNAStringSet(refseq_fasta), 1,10) train_reads_example <- system.file("extdata", "train_seqs_500.sorted.bam", package = "single") counts_pnq <- pileup_by_QUAL(train_reads_example,pos_start=1,pos_end=10) p_prior_mutations <- p_prior_mutations(rates.matrix = mutation_rate, mean.n.mut = 5,ref_seq = ref_seq) head(p_prior_mutations)
refseq_fasta <- system.file("extdata", "ref_seq.fasta", package = "single") ref_seq <- Biostrings::subseq(Biostrings::readDNAStringSet(refseq_fasta), 1,10) train_reads_example <- system.file("extdata", "train_seqs_500.sorted.bam", package = "single") counts_pnq <- pileup_by_QUAL(train_reads_example,pos_start=1,pos_end=10) p_prior_mutations <- p_prior_mutations(rates.matrix = mutation_rate, mean.n.mut = 5,ref_seq = ref_seq) head(p_prior_mutations)
To explain
pileup_by_QUAL( bam_file, QUAL_values = seq(93, 0), pos_start = NA, pos_end = NA )
pileup_by_QUAL( bam_file, QUAL_values = seq(93, 0), pos_start = NA, pos_end = NA )
bam_file |
Bam file to pile up |
QUAL_values |
Numeric vector. QUAL values to analyze in the data. |
pos_start |
Numeric. Position to start analyzing, counting starts from 1 and it refers to reference used for minimap2 alignment. |
pos_end |
Numeric. Position to stop analyzing, counting starts from 1 and it refers to reference used for minimap2 alignment. |
data.frame with columns strand,pos,nucleotide,QUAL,countss
refseq_fasta <- system.file("extdata", "ref_seq.fasta", package = "single") train_reads_example <- system.file("extdata", "train_seqs_500.sorted.bam", package = "single") counts_pnq <- pileup_by_QUAL(bam_file=train_reads_example, pos_start=1,pos_end=10) head(counts_pnq)
refseq_fasta <- system.file("extdata", "ref_seq.fasta", package = "single") train_reads_example <- system.file("extdata", "train_seqs_500.sorted.bam", package = "single") counts_pnq <- pileup_by_QUAL(bam_file=train_reads_example, pos_start=1,pos_end=10) head(counts_pnq)
Main function to compute consensus after correcting reads by a SINGLE model.
single_consensus_byBarcode( barcodes_table, sequences, readID_col = 1, bcID_col = 2, header = TRUE, dec = ".", sep = " ", verbose = TRUE )
single_consensus_byBarcode( barcodes_table, sequences, readID_col = 1, bcID_col = 2, header = TRUE, dec = ".", sep = " ", verbose = TRUE )
barcodes_table |
data.frame or file name containing the names of the reads and the barcode associated (or any grouping tag). |
sequences |
QualityScaledDNAStringSet or fastq file name. Contains sequences from which compute weighted consensus. |
readID_col , bcID_col
|
Numeric. Columns where the reads id and the barcode (or grouping tag) are, in the barcodes_table |
header , dec , sep
|
Arguments for read.table(barcodes_table) |
verbose |
Logical. |
DNAStringSet with consensus sequences
pos_start=1 pos_end = 100 barcodes_file = system.file("extdata", "Barcodes_table.txt",package = "single") reads_single = system.file("extdata", "corrected_seqs.fastq", package = "single") single_consensus_byBarcode(barcodes_file,reads_single, verbose = FALSE)
pos_start=1 pos_end = 100 barcodes_file = system.file("extdata", "Barcodes_table.txt",package = "single") reads_single = system.file("extdata", "corrected_seqs.fastq", package = "single") single_consensus_byBarcode(barcodes_file,reads_single, verbose = FALSE)
Main function to evaluate a gene library using a SINGLE model.
single_evaluate( bamfile, single_fits, refseq_fasta, pos_start = NULL, pos_end = NULL, gaps_weights, save = FALSE, output_file, verbose = FALSE, save_original_scores = FALSE )
single_evaluate( bamfile, single_fits, refseq_fasta, pos_start = NULL, pos_end = NULL, gaps_weights, save = FALSE, output_file, verbose = FALSE, save_original_scores = FALSE )
bamfile |
File containing the counts per position returned by samtools mpileup |
single_fits |
Results of the SINGLE model as returned by single_train(). It can be either the output data.frame or the saved file. |
refseq_fasta |
Fasta file containing reference sequence |
pos_start |
Numeric. Position to start analyzing, counting starts from 1 and it refers to reference used for minimap2 alignment. |
pos_end |
Numeric. Position to stop analyzing, counting starts from 1 and it refers to reference used for minimap2 alignment. |
gaps_weights |
One of "minimum","none","mean". How to assign qscores to deletions. |
save |
Logical. Should data be saved in a output_file? |
output_file |
File name for output, if save=TRUE. |
verbose |
Logical |
save_original_scores |
Logical. Should original Qscores be saved? If TRUE, they are stored in file whose name finishes in _original.fastq |
Before running single_evaluate_function you have to align your INPUT data to a REFERENCE using minimap2 and count the nucleotides per position using samtools using these lines:
minimap2 -ax map-ont --sam-hit-only REFERENCE.fasta INPUT.fastq >ALIGNMENT.sam
samtools view -S -b ALIGNMENT.sam > ALIGNMENT.bam
samtools sort ALIGNMENT.bam -o ALIGNMENT.sorted.bam
samtools mpileup -Q 0 ALIGNMENT.sorted.bam > COUNTS.txt
Creates file output_prefix_corrected.txt with the Qscores re-scaled by SINGLE. Columns are SeqID position nucleotide isWT original_quality p_SINGLe
refseq_fasta = system.file("extdata", "ref_seq_10bases.fasta", package = "single") train_file <- system.file("extdata", "train_example.txt", package = "single") train <- read.table(train_file, header=TRUE) test_reads_example <- system.file("extdata", "test_sequences.sorted.bam", package = "single") corrected_reads <- single_evaluate(bamfile = test_reads_example, single_fits = train,refseq_fasta = refseq_fasta, pos_start=1,pos_end=10,gaps_weights = "minimum") corrected_reads
refseq_fasta = system.file("extdata", "ref_seq_10bases.fasta", package = "single") train_file <- system.file("extdata", "train_example.txt", package = "single") train <- read.table(train_file, header=TRUE) test_reads_example <- system.file("extdata", "test_sequences.sorted.bam", package = "single") corrected_reads <- single_evaluate(bamfile = test_reads_example, single_fits = train,refseq_fasta = refseq_fasta, pos_start=1,pos_end=10,gaps_weights = "minimum") corrected_reads
Main function to train a SINGLE model in a set of reads of a reference / wild type sequence. To get the input data you will need to run before a minimap2 alignment and samtools counts.
single_train( bamfile, output = "results", refseq_fasta, rates.matrix = NULL, mean.n.mutations = NULL, pos_start = NULL, pos_end = NULL, verbose = TRUE, save_partial = FALSE, save_final = FALSE )
single_train( bamfile, output = "results", refseq_fasta, rates.matrix = NULL, mean.n.mutations = NULL, pos_start = NULL, pos_end = NULL, verbose = TRUE, save_partial = FALSE, save_final = FALSE )
bamfile |
File containing the counts per position returned by samtools mpileup |
output |
String. Prefix for output files |
refseq_fasta |
Fasta file containing reference sequence |
rates.matrix |
Mutation rate matrix: 4x5 matrix, each row/col representing a nucleotide (col adds deletion), and the values is the mutational rate from row to col. |
mean.n.mutations |
Mean number of mutations expected (one number). |
pos_start |
Numeric. Position to start analyzing, counting starts from 1 and it refers to reference used for minimap2 alignment. |
pos_end |
Numeric. Position to stop analyzing, counting starts from 1 and it refers to reference used for minimap2 alignment. |
verbose |
Logical. |
save_partial |
Logical. Should partial results be saved in files? |
save_final |
Logical. Should final fits be saved in a file? |
Before running single_train_function you have to align your INPUT data to a REFERENCE using minimap2 and count the nucleotides per position using samtools using these lines:
minimap2 -ax map-ont --sam-hit-only REFERENCE.fasta INPUT.fastq >ALIGNMENT.sam
samtools view -S -b ALIGNMENT.sam > ALIGNMENT.bam
samtools sort ALIGNMENT.bam -o ALIGNMENT.sorted.bam
samtools mpileup -Q 0 ALIGNMENT.sorted.bam > COUNTS.txt
Creates file output_prefix_single_results.txt with SINGLE training results.
refseq_fasta<- system.file("extdata", "ref_seq.fasta", package = "single") train_reads_example <- system.file("extdata", "train_seqs_500.sorted.bam", package = "single") train <- single_train(bamfile=train_reads_example, refseq_fasta=refseq_fasta, rates.matrix=mutation_rate,mean.n.mutations=5.4, pos_start=1,pos_end=10) print(head(train))
refseq_fasta<- system.file("extdata", "ref_seq.fasta", package = "single") train_reads_example <- system.file("extdata", "train_seqs_500.sorted.bam", package = "single") train <- single_train(bamfile=train_reads_example, refseq_fasta=refseq_fasta, rates.matrix=mutation_rate,mean.n.mutations=5.4, pos_start=1,pos_end=10) print(head(train))
This is an auxiliary function in single package. It computes consensus from a data.frame as the one returned by single_evaluate()
weighted_consensus(df, cutoff_prob = 0.2)
weighted_consensus(df, cutoff_prob = 0.2)
df |
data.frame with the columns: nucleotide, probability, position |
cutoff_prob |
Numeric. Nucleotides with probability below this number will be removed from consensus computation. |
Character vector, consensus sequence
fastq_seqs_example <- system.file("extdata", "test_sequences.fastq",package = "single") seqs_example <- Biostrings::readQualityScaledDNAStringSet(fastq_seqs_example) # Using single weights data_barcode = data.frame( nucleotide = unlist(sapply(as.character(seqs_example),strsplit, split="")), p_SINGLe=unlist(1-as(Biostrings::quality(seqs_example),"NumericList")), pos=rep(1:Biostrings::width(seqs_example[1]),length(seqs_example))) weighted_consensus(df = data_barcode, cutoff_prob = 0.9) # Replacing weights by ones data_barcode = data.frame( nucleotide = unlist(sapply(as.character(seqs_example),strsplit, split="")), p_SINGLe=1,pos=rep(1,sum(Biostrings::width(seqs_example)))) weighted_consensus(df = data_barcode, cutoff_prob = 0)
fastq_seqs_example <- system.file("extdata", "test_sequences.fastq",package = "single") seqs_example <- Biostrings::readQualityScaledDNAStringSet(fastq_seqs_example) # Using single weights data_barcode = data.frame( nucleotide = unlist(sapply(as.character(seqs_example),strsplit, split="")), p_SINGLe=unlist(1-as(Biostrings::quality(seqs_example),"NumericList")), pos=rep(1:Biostrings::width(seqs_example[1]),length(seqs_example))) weighted_consensus(df = data_barcode, cutoff_prob = 0.9) # Replacing weights by ones data_barcode = data.frame( nucleotide = unlist(sapply(as.character(seqs_example),strsplit, split="")), p_SINGLe=1,pos=rep(1,sum(Biostrings::width(seqs_example)))) weighted_consensus(df = data_barcode, cutoff_prob = 0)