Package 'polyester' reference manual

Title:	Simulate RNA-seq reads
Description:	This package can be used to simulate RNA-seq reads from differential expression experiments with replicates. The reads can then be aligned and used to perform comparisons of methods for differential expression.
Authors:	Alyssa C. Frazee, Andrew E. Jaffe, Rory Kirchner, Jeffrey T. Leek
Maintainer:	Jack Fu <[email protected]>, Jeff Leek <[email protected]>
License:	Artistic-2.0
Version:	1.41.0
Built:	2024-07-02 03:18:08 UTC
Source:	https://github.com/bioc/polyester

add sequencing error to simulated reads

Description

simulate sequencing error by randomly changing the sequenced nucleotide on some of the reads

Usage

add_error(tFrags, error_rate = 0.005)
add_error(tFrags, error_rate = 0.005)

Arguments

`tFrags`	DNAStringSet representing sequencing reads
`error_rate`	error probability

Value

DNAStringSet equivalent to tFrags but with random sequencing errors inserted

Examples

  library(Biostrings)
  data(srPhiX174)
  set.seed(174)
  srPhiX174_withError = add_error(srPhiX174)
  #error was introduced in, e.g., position 10 of 2nd string in set.
library(Biostrings)
  data(srPhiX174)
  set.seed(174)
  srPhiX174_withError = add_error(srPhiX174)
  #error was introduced in, e.g., position 10 of 2nd string in set.

add GC bias to a count matrix

Description

Given a matrix with rows corresponding to transcripts and sample-specific GC bias models, bias the count matrix using the bias model.

Usage

add_gc_bias(readmat, gcbias, transcripts)
add_gc_bias(readmat, gcbias, transcripts)

Arguments

`readmat`	matrix of counts, with rows corresponding to features (transcripts) and columns corresponding to replicates
`gcbias`	List of GC bias models to add to readmat. Must have length equal to the number of columns of `readmat`. List elements must either be integers 0 through 7, where 0 means no bias and 1-7 correspond to built-in GC bias models, or objects of class `loess` which can predict a deviation from overall mean count (on the log scale) given a GC percentage between 0 and 1.
`transcripts`	`DNAStringSet` object containing the sequences of the features (transcripts) corresponding to the rows of `readmat`. Length must be equal to the number of rows in `readmat`.

Details

Designed for internal use in simulate_experiment functions.

Value

matrix of the same size as readmat, but with counts for each replicate biased according to gcbias.

Examples

  library(Biostrings)
  fastapath = system.file("extdata", "chr22.fa", package="polyester")
  numtx = count_transcripts(fastapath)
  transcripts = readDNAStringSet(fastapath)

  # create a count matrix:
  readmat = matrix(20, ncol=10, nrow=numtx)
  readmat[1:30, 1:5] = 40

  # add biases randomly: use built-in bias models
  set.seed(137)
  biases = sample(0:7, 10, replace=TRUE)
  readmat_biased = add_gc_bias(readmat, as.list(biases), transcripts)
library(Biostrings)
  fastapath = system.file("extdata", "chr22.fa", package="polyester")
  numtx = count_transcripts(fastapath)
  transcripts = readDNAStringSet(fastapath)

  # create a count matrix:
  readmat = matrix(20, ncol=10, nrow=numtx)
  readmat[1:30, 1:5] = 40

  # add biases randomly: use built-in bias models
  set.seed(137)
  biases = sample(0:7, 10, replace=TRUE)
  readmat_biased = add_gc_bias(readmat, as.list(biases), transcripts)

Simulate sequencing error using empirical error model

Description

Given a sequencing platform and a set of sequencing reads, add sequencing errors to the reads given a known error profile from the platform.

Usage

add_platform_error(tFrags, platform, paired, path = NULL)
add_platform_error(tFrags, platform, paired, path = NULL)

Arguments

`tFrags`	DNAStringSetList containing error-free sequencing reads. If simulating a paired-end experiment, mate-pairs should appear next to each other in tFrags.
`platform`	Which sequencing platform should the error model be estimated from? Currently supports `'illumina4'`, `'illumina5'`, `'roche454'`, and `'custom'`.
`paired`	Does `tFrags` contain paired end reads, with mate pairs next to each other? (TRUE if yes.)
`path`	if platform is `'custom'`, provide the path to the error model. After processing the error model with `build_error_models.py`, you will have either two files (ending in _mate1 and _mate2, if your model was for paired-end reads) or one file (ending in _single, if your model was for single-end reads). The `path` argument should be the path to the error model up to but not including _mate1/_mate2/_single.

Details

This function adds sequencing error to a set of reads based on the position in the read and the true nucleotide at that location. Position-specific probabilities of making each possible sequencing error (reading a T when it should have been A, reading a G when it should have been T, etc.) were calculated for each of three platforms using the empirical error models available with the GemSIM software (see references). Users can also estimate an error model from their own data using GemSIM and can use that error model with Polyester as described in the vignette. (You will need to run a Python script available at the Polyester GitHub repository to process the error model).

Value

DNAStringSet object that is the same as tFrags except but with sequencing error added.

References

McElroy KE, Luciani F and Thomas T (2012): GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13(1), 74.

Examples

  library(Biostrings)
  # pretend the srPhiX174 DNAStringSet represents 35bp single-end 
  # sequencing reads:
  data(srPhiX174) 
  set.seed(718)
  data_with_errors = add_platform_error(srPhiX174, 'illumina4', paired=FALSE)
  
  # the 17th read in this set has an error at position 20:
  data_with_errors[17][[1]][20] # N
  srPhiX174[17][[1]][20] # T
  
  # 101 reads total have at least one sequencing error:
  sum(data_with_errors != srPhiX174)
 
library(Biostrings)
  # pretend the srPhiX174 DNAStringSet represents 35bp single-end 
  # sequencing reads:
  data(srPhiX174) 
  set.seed(718)
  data_with_errors = add_platform_error(srPhiX174, 'illumina4', paired=FALSE)
  
  # the 17th read in this set has an error at position 20:
  data_with_errors[17][[1]][20] # N
  srPhiX174[17][[1]][20] # T
  
  # 101 reads total have at least one sequencing error:
  sum(data_with_errors != srPhiX174)

Model of positional bias that can arise when RNA-seq is performed using protocols relying on cDNA fragmentation.

Description

This positional bias model was estimated in Li and Jiang (2012). With cDNA fragmentation, reads are more likely to have come from the 3' end of the transcript. The probabilities included in this dataset were estimated from Supplementary Figure S3 in Li and Jiang's manuscript. Data points from the figure were inferred and exported as CSV files using WebPlotDigitizer. The CSV files and the code used to process them and create the datasets are available in the Polyester GitHub repository (https://github.com/alyssafrazee/polyester).

Format

data frame with 100 rows and 2 columns. Column 1 is position along a transcript (in percent), while Column 2 is the probability of getting a fragment at that position. Column 2 sums to 1.

References

Li W and Jiang T (2012): Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics 28(22): 2914-2921.

Rohatgi A (2014): WebPlotDigitizer: Version 3.4 of WebPlotDigitizer. ZENODO. 10.5281/zenodo.11835

determine how many transcripts are annotated in a FASTA or GTF file

Description

determine how many transcripts are annotated in a FASTA or GTF file

Usage

count_transcripts(
  f,
  fasta = TRUE,
  identifier = "transcript_id",
  attrsep = "; "
)
count_transcripts(
  f,
  fasta = TRUE,
  identifier = "transcript_id",
  attrsep = "; "
)

Arguments

`f`	character, path to a file in FASTA or GTF format
`fasta`	TRUE if `f` is a fasta file; FALSE if `f` is a GTF file
`identifier`	if `f` is a GTF file, how are transcripts identified in the attributes field (9th column) of the file? Default `transcript_id`.
`attrsep`	if `f` is a GTF file, how are attributes separated in the attributes field (9th column) of the file? Default "; ".

Value

Number of transcripts annotated in f

Examples

fastapath = system.file("extdata", "chr22.fa", package="polyester")
count_transcripts(fastapath) #918

fastapath = system.file("extdata", "chr22.fa", package="polyester")
count_transcripts(fastapath) #918

Generate a simulated data set based on known model parameters

Description

Generate a simulated data set based on known model parameters

Usage

create_read_numbers(
  mu,
  fit,
  p0,
  m = NULL,
  n = NULL,
  mod = NULL,
  beta = NULL,
  seed = NULL
)
create_read_numbers(
  mu,
  fit,
  p0,
  m = NULL,
  n = NULL,
  mod = NULL,
  beta = NULL,
  seed = NULL
)

Arguments

`mu`	Baseline mean expression for negative binomial model
`fit`	Fitted relationship between log mean and log size
`p0`	A vector of the probabilities a count is zero
`m`	Number of genes/transcripts to simulate (not necessary if mod, beta are specified)
`n`	Number of samples to simulate (not necessary if mod, beta are specified)
`mod`	Model matrix you would like to simulate from without an intercept
`beta`	set of coefficients for the model matrix (must have same number of columns as mod)
`seed`	optional seed to set (for reproducibility)

Value

counts Data matrix with counts for genes in rows and samples in columns

Author(s)

Jeff Leek

Examples

  library(ballgown)
  data(bg)
  countmat = fpkm_to_counts(bg, mean_rps=400000)
  params = get_params(countmat)
  Ntranscripts = 50
  Nsamples = 10
  custom_readmat = create_read_numbers(mu=params$mu, fit=params$fit,
    p0=params$p0, m=Ntranscripts, n=Nsamples, seed=103)

library(ballgown)
  data(bg)
  countmat = fpkm_to_counts(bg, mean_rps=400000)
  params = get_params(countmat)
  Ntranscripts = 50
  Nsamples = 10
  custom_readmat = create_read_numbers(mu=params$mu, fit=params$fit,
    p0=params$p0, m=Ntranscripts, n=Nsamples, seed=103)

Estimated distribution of fragment lengths

Description

Empirical fragment length distribution was estimated using 7 randomly selected RNA-seq samples from the GEUVADIS dataset ('t Hoen et al 2013). One sample was selected from each of the 7 laboratories that performed the sequencing. We used Picard's "CollectInsertSizeMetrics" tool (http://broadinstitude.github.io/picard/), version 1.121, to estimate the fragment size distribution based on read alignments. Code we used to estimate this distribution is available at https://github.com/alyssafrazee/polyester/blob/master/make_fraglen_model.R.

Format

logspline object (created with logspline) specifying the empirical density of fragment lengths in the 7 GEUVADIS samples.

References

't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.

Turn FPKMs from a ballgown object into estimated counts for transcripts

Description

Turn FPKMs from a ballgown object into estimated counts for transcripts

Usage

fpkm_to_counts(
  bg = NULL,
  mat = NULL,
  tlengths = NULL,
  mean_rps = 1e+08,
  threshold = 0
)
fpkm_to_counts(
  bg = NULL,
  mat = NULL,
  tlengths = NULL,
  mean_rps = 1e+08,
  threshold = 0
)

Arguments

`bg`	ballgown object created from real RNA-seq dataset
`mat`	matrix of isoform-level FPKMs from which to derive counts. Rows should represent transcripts and columns should represent counts. Provide exactly one of `bg` or `mat`.
`tlengths`	if using `mat` instead of `bg`, vector of transcript lengths. Entries correspond to the rows of `mat`. Lengths should only count the nucleotides within transcripts' exons.
`mean_rps`	This should be the number of reads per sample in total for use in backing out the FPKM calculations.
`threshold`	only estimate parameters from transcripts with mean FPKM measurements at least as large as `threshold`.

Details

If transcripts/exons are represented by GRanges or GRangesList objects, the width function is really useful in calculating transcript lengths.

Value

A matrix of counts with the same number of rows and columns as the ballgown object

Author(s)

Jeff Leek

Examples

  library(ballgown)
  data(bg)
  countmat = fpkm_to_counts(bg, mean_rps=400000)

library(ballgown)
  data(bg)
  countmat = fpkm_to_counts(bg, mean_rps=400000)

generate a set of fragments from a set of transcripts

Description

Convert each sequence in a DNAStringSet to a "fragment" (subsequence)

Usage

generate_fragments(
  tObj,
  fraglen = 250,
  fragsd = 25,
  readlen = 100,
  distr = "normal",
  custdens = NULL,
  bias = "none",
  frag_GC_bias = "none"
)
generate_fragments(
  tObj,
  fraglen = 250,
  fragsd = 25,
  readlen = 100,
  distr = "normal",
  custdens = NULL,
  bias = "none",
  frag_GC_bias = "none"
)

Arguments

`tObj`	DNAStringSet of sequences from which fragments should be extracted
`fraglen`	Mean fragment length, if drawing fragment lengths from a normal distribution.
`fragsd`	Standard deviation of fragment lengths, if drawing lengths from a normal distribution. Note: `fraglen` and `fragsd` are ignored unless `distr` is 'normal'.
`readlen`	Read length. Default 100. Used only to label read positions.
`distr`	One of 'normal', 'empirical', or 'custom'. If 'normal', draw fragment lengths from a normal distribution with mean `fraglen` and standard deviation `fragsd`. If 'empirical', draw fragment lengths from a fragment length distribution estimated from a real data set. If 'custom', draw fragment lengths from a custom distribution, provided as the `custdens` argument, which should be a density fitted using `logspline`.
`custdens`	If `distr` is 'custom', draw fragments from this density. Should be an object of class `logspline`.
`bias`	One of 'none', 'rnaf', or 'cdnaf' (default 'none'). 'none' represents uniform fragment selection (every possible fragment in a transcript has equal probability of being in the experiment); 'rnaf' represents positional bias that arises in protocols using RNA fragmentation, and 'cdnaf' represents positional bias arising in protocols that use cDNA fragmentation (Li and Jiang 2012). Using the 'rnaf' model, coverage is higher in the middle of the transcript and lower at both ends, and in the 'cdnaf' model, coverage increases toward the 3' end of the transcript. The probability models used come from Supplementary Figure S3 of Li and Jiang (2012).
`frag_GC_bias`	See explanation in `simulate_experiment`.

Details

The empirical fragment length distribution was estimated using 7 randomly selected RNA-seq samples from the GEUVADIS dataset ('t Hoen et al 2013), one sample from each laboratory that performed sequencing for that data set. We used Picard's "CollectInsertSizeMetrics" (http://broadinstitute.github.io/picard/), version 1.121, to estimate the insert size distribution based on the read alignments.

Value

DNAStringSet consisting of one randomly selected subsequence per element of tObj.

References

't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.

Li W and Jiang T (2012): Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics 28(22): 2914-2921.

Examples

  library(Biostrings)
  data(srPhiX174)

  ## get fragments with lengths drawn from normal distrubution
  set.seed(174)
  srPhiX174_fragments = generate_fragments(srPhiX174, fraglen=15, fragsd=3,
      readlen=4)
  srPhiX174_fragments
  srPhiX174

  ## get fragments with lengths drawn from an empirical distribution
  empirical_frags = generate_fragments(srPhiX174, distr='empirical')
  empirical_frags

  ## get fragments with lengths from a normal distribution, but include
  ## positional bias from cDNA fragmentation:
  biased_frags = generate_fragments(srPhiX174, bias='cdnaf')
  biased_frags

library(Biostrings)
  data(srPhiX174)

  ## get fragments with lengths drawn from normal distrubution
  set.seed(174)
  srPhiX174_fragments = generate_fragments(srPhiX174, fraglen=15, fragsd=3,
      readlen=4)
  srPhiX174_fragments
  srPhiX174

  ## get fragments with lengths drawn from an empirical distribution
  empirical_frags = generate_fragments(srPhiX174, distr='empirical')
  empirical_frags

  ## get fragments with lengths from a normal distribution, but include
  ## positional bias from cDNA fragmentation:
  biased_frags = generate_fragments(srPhiX174, bias='cdnaf')
  biased_frags

Estimate zero-inflated negative binomial parameters from a real dataset

Description

This function estimates the parameters of a zero inflated negative binomial distribution based on a real count data set based on the method of moments. The function also returns a spline fit of log mean to log size which can be used when generating new simulated data.

Usage

get_params(counts, threshold = NULL)
get_params(counts, threshold = NULL)

Arguments

`counts`	A matrix of counts. If you want to simulate from a ballgown object, see `fpkm_to_counts`
`threshold`	Only estimate parameters from transcripts with row means greater than threshold

Value

p0 A vector of probabilities that the count will be zero, one for each gene/transcript.

mu The estimated negative binomial mean by method of moments for the non-zero counts

size The estimated negative binomial size by method of moments for the non-zero counts

fit A fit relating log mean to log size for use in simulating new data.

Author(s)

Jeff Leek

Examples

  library(ballgown)
  data(bg)
  countmat = fpkm_to_counts(bg, mean_rps=400000)
  params = get_params(countmat)

library(ballgown)
  data(bg)
  countmat = fpkm_to_counts(bg, mean_rps=400000)
  params = get_params(countmat)

get sequencing reads from fragments

Description

simulate the sequencing process by returning the sequence of one or both ends of provided fragments

Usage

get_reads(tFrags, readlen, paired = TRUE)
get_reads(tFrags, readlen, paired = TRUE)

Arguments

`tFrags`	DNAStringSet representing fragments
`readlen`	Read length.
`paired`	If `FALSE`, return only the first `readlen` bases of each element of `tFrags` in the result; if `TRUE`, also return last `readlen` bases.

Value

DNAStringSet representing simulated RNA-seq reads

Examples

  library(Biostrings)
  data(srPhiX174)
  set.seed(174)
  srPhiX174_reads = get_reads(srPhiX174, readlen=15, paired=FALSE)
  srPhiX174_reads  
  # set of single-end, 15bp reads, treating srPhiX174 as the fragments
library(Biostrings)
  data(srPhiX174)
  set.seed(174)
  srPhiX174_reads = get_reads(srPhiX174, readlen=15, paired=FALSE)
  srPhiX174_reads  
  # set of single-end, 15bp reads, treating srPhiX174 as the fragments

extract a specific field of the "attributes" column of a data frame created from a GTF/GFF file

Description

extract a specific field of the "attributes" column of a data frame created from a GTF/GFF file

Usage

getAttributeField(x, field, attrsep = "; ")
getAttributeField(x, field, attrsep = "; ")

Arguments

`x`	vector representing the "attributes" column of GTF/GFF file
`field`	name of the field you want to extract from the "attributes" column
`attrsep`	separator for the fields in the attributes column. Defaults to '; ', the separator for GTF files outputted by Cufflinks.

Value

vector of nucleotide positions included in the transcript

Author(s)

Wolfgang Huber, in the davidTiling package (LGPL license)

Examples

  library(ballgown)
  gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown')
  gffdata = gffRead(gtfPath)
  gffdata$transcriptID = getAttributeField(gffdata$attributes, 
    field = "transcript_id")

library(ballgown)
  gtfPath = system.file('extdata', 'annot.gtf.gz', package='ballgown')
  gffdata = gffRead(gtfPath)
  gffdata$transcriptID = getAttributeField(gffdata$attributes, 
    field = "transcript_id")

data frame (in gtf-inspired format) for chromosome 22, hg19

Description

In the data frame gtf_dataframe, each row corresponds to an exon / coding sequence / start codon / stop codon, and the columns correspond to standard GTF columns denoting annotated genomic features. See http://www.ensembl.org/info/website/upload/gff.html.

Format

data frame, 9 columns, 17769 rows

Source

Illumina iGenomes, hg19, 6 March 2013 version: http://ccb.jhu.edu/software/tophat/igenomes.shtml.

Empirical GC bias model, NA06985

Description

Loess model for log counts measuring transcript expression as a function of the transcript's GC content. The model was created using sample NA06985 in the Ballgown obtained at http://files.figshare.com/1625419/fpkm.rda

Format

Object of class loess

Source

Constructed using the code available at https://github.com/alyssafrazee/polyester/blob/master/gc_bias.R

References

GEUVADIS data set: 't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.

Lappalainen, et al (2013): Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501: 506-511.

Empirical GC bias model, NA12144

Description

Loess model for log counts measuring transcript expression as a function of the transcript's GC content. The model was created using sample NA12144 in the Ballgown obtained at http://files.figshare.com/1625419/fpkm.rda

Format

Object of class loess

Source

Constructed using the code available at https://github.com/alyssafrazee/polyester/blob/master/gc_bias.R

References

GEUVADIS data set: 't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.

Lappalainen, et al (2013): Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501: 506-511.

Empirical GC bias model, NA12776

Description

Loess model for log counts measuring transcript expression as a function of the transcript's GC content. The model was created using sample NA12776 in the Ballgown obtained at http://files.figshare.com/1625419/fpkm.rda

Format

Object of class loess

Source

Constructed using the code available at https://github.com/alyssafrazee/polyester/blob/master/gc_bias.R

References

GEUVADIS data set: 't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.

Lappalainen, et al (2013): Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501: 506-511.

Empirical GC bias model, NA18858

Description

Loess model for log counts measuring transcript expression as a function of the transcript's GC content. The model was created using sample NA18858 in the Ballgown obtained at http://files.figshare.com/1625419/fpkm.rda

Format

Object of class loess

Source

Constructed using the code available at https://github.com/alyssafrazee/polyester/blob/master/gc_bias.R

References

GEUVADIS data set: 't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.

Lappalainen, et al (2013): Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501: 506-511.

Empirical GC bias model, NA20542

Description

Loess model for log counts measuring transcript expression as a function of the transcript's GC content. The model was created using sample NA20542 in the Ballgown obtained at http://files.figshare.com/1625419/fpkm.rda

Format

Object of class loess

Source

Constructed using the code available at https://github.com/alyssafrazee/polyester/blob/master/gc_bias.R

References

GEUVADIS data set: 't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.

Lappalainen, et al (2013): Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501: 506-511.

Empirical GC bias model, NA20772

Description

Loess model for log counts measuring transcript expression as a function of the transcript's GC content. The model was created using sample NA20772 in the Ballgown obtained at http://files.figshare.com/1625419/fpkm.rda

Format

Object of class loess

Source

Constructed using the code available at https://github.com/alyssafrazee/polyester/blob/master/gc_bias.R

References

GEUVADIS data set: 't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.

Lappalainen, et al (2013): Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501: 506-511.

Empirical GC bias model, NA20815

Description

Loess model for log counts measuring transcript expression as a function of the transcript's GC content. The model was created using sample NA20815 in the Ballgown obtained at http://files.figshare.com/1625419/fpkm.rda

Format

Object of class loess

Source

Constructed using the code available at https://github.com/alyssafrazee/polyester/blob/master/gc_bias.R

References

GEUVADIS data set: 't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.

Lappalainen, et al (2013): Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501: 506-511.

Empirical error model for Illumina Genome Analyzer IIx with Illumina Sequencing Kit v4 chemistry, read mate 1 of a pair

Description

for each position in mate 1 of a paired-end read generated with the specified Illumina chemistry, this data frame contains the probability of not making a sequencing error, and of making each of the 4 possible types of sequencing errors. The reference base (truth) is in column 1, and the probabilities of sequencing that base given its read position (column 7) as each of the 5 possible bases (A, T, G, C, and N) is given in columns 2 through 6, respectively. So for example, at position 8 in mate 1 of a read where the true base is A, the probability of correctly calling that base an A is 0.9998, the probability of making an error by sequencing a T is 2.64e-05, the probability of making an error by sequencing a G is 1.58e-04, the probability of making an error by sequencing a C is 3.05e-05, and the probability of reading an 'N' at position 8 is 0. This can be seen by looking at model1[model1$pos == 8,]. Note that position indexing is 1-based, though a 0 position is included as described in the GemSIM documentation.

Format

data frame named model1, 7 columns, 505 rows

Source

processed from the Illumina v4 error model that ships with GemSIM (see references)

References

McElroy KE, Luciani F, Thomas T (2012). GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13(1), 74.

Empirical error model for Illumina Genome Analyzer IIx with Illumina Sequencing Kit v4 chemistry, read mate 2 of a pair

Description

for each position in mate 2 of a paired-end read generated with the specified Illumina chemistry, this data frame contains the probability of not making a sequencing error, and of making each of the 4 possible types of sequencing errors. The reference base (truth) is in column 1, and the probabilities of sequencing that base given its read position (column 7) as each of the 5 possible bases (A, T, G, C, and N) is given in columns 2 through 6, respectively. So for example, at position 8 in mate 1 of a read where the true base is A, the probability of correctly calling that base an A is 0.9995, the probability of making an error by sequencing a T is 0.00017, the probability of making an error by sequencing a G is 0.00023, the probability of making an error by sequencing a C is 6.02e-05, and the probability of reading an 'N' at position 8 is 1.15e-05. This can be seen by looking at model2[model2$pos == 8,]. Note that position indexing is 1-based, though a 0 position is included as described in the GemSIM documentation.

Format

data frame named model2, 7 columns, 505 rows

Source

processed from the Illumina v4 error model that ships with GemSIM (see references)

References

McElroy KE, Luciani F, Thomas T (2012). GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13(1), 74.

Empirical error model for Illumina Genome Analyzer IIx with Illumina Sequencing Kit v4 chemistry, single-end read

Description

for each position in a single-end read generated with the specified Illumina chemistry, this data frame contains the probability of not making a sequencing error, and of making each of the 4 possible types of sequencing errors. The reference base (truth) is in column 1, and the probabilities of sequencing that base given its read position (column 7) as each of the 5 possible bases (A, T, G, C, and N) is given in columns 2 through 6, respectively. So for example, at position 8 in mate 1 of a read where the true base is A, the probability of correctly calling that base an A is 0.9998, the probability of making an error by sequencing a T is 2.95e-05, the probability of making an error by sequencing a G is 1.27e-04, the probability of making an error by sequencing a C is 1.85e-05, and the probability of reading an 'N' at position 8 is 0. This can be seen by looking at model3[model3$pos == 8,]. Note that position indexing is 1-based, though a 0 position is included as described in the GemSIM documentation.

Format

data frame named model3, 7 columns, 505 rows

Source

processed from the Illumina v4 error model that ships with GemSIM (see references)

References

McElroy KE, Luciani F, Thomas T (2012). GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13(1), 74.

Empirical error model for Illumina Genome Analyzer IIx with TrueSeq SBS Kit v5-GA chemistry, read mate 1 of a pair

Description

for each position in mate 1 of a paired-end read generated with the specified Illumina chemistry, this data frame contains the probability of not making a sequencing error, and of making each of the 4 possible types of sequencing errors. The reference base (truth) is in column 1, and the probabilities of sequencing that base given its read position (column 7) as each of the 5 possible bases (A, T, G, C, and N) is given in columns 2 through 6, respectively. So for example, at position 8 in mate 1 of a read where the true base is A, the probability of correctly calling that base an A is 0.9998, the probability of making an error by sequencing a T is 4.00e-05, the probability of making an error by sequencing a G is 1.58e-04, the probability of making an error by sequencing a C is 1.46e-05, and the probability of reading an 'N' at position 8 is 0. This can be seen by looking at model4[model4$pos == 8,]. Note that position indexing is 1-based, though a 0 position is included as described in the GemSIM documentation.

Format

data frame named model4, 7 columns, 505 rows

Source

processed from the Illumina v5 error model that ships with GemSIM (see references)

References

McElroy KE, Luciani F, Thomas T (2012). GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13(1), 74.

Empirical error model for Illumina Genome Analyzer IIx with TrueSeq SBS Kit v5-GA chemistry, read mate 2 of a pair

Description

for each position in mate 2 of a paired-end read generated with the specified Illumina chemistry, this data frame contains the probability of not making a sequencing error, and of making each of the 4 possible types of sequencing errors. The reference base (truth) is in column 1, and the probabilities of sequencing that base given its read position (column 7) as each of the 5 possible bases (A, T, G, C, and N) is given in columns 2 through 6, respectively. So for example, at position 8 in mate 1 of a read where the true base is A, the probability of correctly calling that base an A is 0.9992, the probability of making an error by sequencing a T is 0.0002, the probability of making an error by sequencing a G is 0.0002, the probability of making an error by sequencing a C is 0.0001, and the probability of reading an 'N' at position 8 is 0.0002. This can be seen by looking at model5[model5$pos == 8,]. Note that position indexing is 1-based, though a 0 position is included as described in the GemSIM documentation.

Format

data frame named model5, 7 columns, 505 rows

Source

processed from the Illumina v5 error model that ships with GemSIM (see references)

References

McElroy KE, Luciani F, Thomas T (2012). GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13(1), 74.

Empirical error model for Illumina Genome Analyzer IIx with TrueSeq SBS Kit v5-GA chemistry, single-end read

Description

for each position in a single-end read generated with the specified Illumina chemistry, this data frame contains the probability of not making a sequencing error, and of making each of the 4 possible types of sequencing errors. The reference base (truth) is in column 1, and the probabilities of sequencing that base given its read position (column 7) as each of the 5 possible bases (A, T, G, C, and N) is given in columns 2 through 6, respectively. So for example, at position 8 in mate 1 of a read where the true base is A, the probability of correctly calling that base an A is 0.9998, the probability of making an error by sequencing a T is 3.04e-05, the probability of making an error by sequencing a G is 1.36e-04, the probability of making an error by sequencing a C is 1.27e-05, and the probability of reading an 'N' at position 8 is 0. This can be seen by looking at model6[model6$pos == 8,]. Note that position indexing is 1-based, though a 0 position is included as described in the GemSIM documentation.

Format

data frame named model6, 7 columns, 505 rows

Source

processed from the Illumina v5 error model that ships with GemSIM (see references)

References

McElroy KE, Luciani F, Thomas T (2012). GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13(1), 74.

Empirical error model Roche/454 FLX Titanium, single-end read

Description

for each position in a single-end read generated with the specified chemistry, this data frame contains the probability of not making a sequencing error, and of making each of the 4 possible types of sequencing errors. The reference base (truth) is in column 1, and the probabilities of sequencing that base given its read position (column 7) as each of the 5 possible bases (A, T, G, C, and N) is given in columns 2 through 6, respectively. So for example, at position 8 in mate 1 of a read where the true base is C, the probability of correctly calling that base a C is 0.9994, the probability of making an error by sequencing a T is 0.0002, the probability of making an error by sequencing a G is 0.0001, the probability of making an error by sequencing an A is 0.0002, and the probability of reading an 'N' at position 8 is 0. This can be seen by looking at model7[model7$pos == 8,]. Note that position indexing is 1-based, though a 0 position is included as described in the GemSIM documentation.

Format

data frame named model7, 7 columns, 505 rows

Source

processed from the Roche 454 error model that ships with GemSIM (see references)

References

McElroy KE, Luciani F, Thomas T (2012). GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13(1), 74.

Draw nonzero negative binomial random numbers

Description

Draw nonzero negative binomial random numbers

Usage

NB(basemeans, size, seed = NULL)
NB(basemeans, size, seed = NULL)

Arguments

`basemeans`	vector of means, one per draw
`size`	vector of size parameters (controlling the mean/variance relationship); one per draw
`seed`	optional seed to set before drawing

Value

vector of negative binomial draws from specified distributions, where any zero draw is replaced with a 1. Length of return vector is equal to length(basemeans).

Examples

  randomNBs = NB(c(100, 4, 29), size=c(50, 2, 4), seed=21)
  randomNBs  # 115, 5, 15

randomNBs = NB(c(100, 4, 29), size=c(50, 2, 4), seed=21)
  randomNBs  # 115, 5, 15

Polyester: simulating RNA-seq reads including differential expression

Description

Polyester is an R package designed to simulate an RNA sequencing experiment. Given a set of annotated transcripts, polyester will simulate the steps of an RNA-seq experiment (fragmentation, reverse-complementing, and sequencing) and produce files containing simulated RNA-seq reads. Simulated reads can be analyzed using any of several downstream analysis tools.

Details

A single function call produces RNA-seq reads in FASTA format from a case/control experiment including biological replicates. Differential expression between cases and controls can be set by the user, facilitating comparisons of statistical differential expression methods for RNA-seq data. See detailed documentation for simulate_experiment and simulate_experiment_countmat.

See the vignette by typing browseVignettes("polyester") in the R prompt.

Author(s)

Alyssa Frazee, Andrew Jaffe, Rory Kirchner, Jeff Leek

References

Alyssa C Frazee, Geo Pertea, Andrew E Jaffe, Ben Langmead, Steven L Salzberg, Jeffrey T Leek (2014). Flexible isoform-level differential expression analysis with Ballgown. BioRxiv preprint: http://biorxiv.org/content/early/2014/03/30/003665.

reverse-complement some fragments

Description

randomly reverse-complement half of the sequences in a DNAStringSet

Usage

reverse_complement(tObj, seed = NULL)
reverse_complement(tObj, seed = NULL)

Arguments

`tObj`	DNAStringSet representing sequences.
`seed`	optional seed to set before randomly selecting the sequences to be reverse-complemented.

Value

DNAStringSet that is the same as tObj, but with about half the sequences reverse-complemented.

Examples

  library(Biostrings)
  data(srPhiX174)
  srPhiX174_halfrc = reverse_complement(srPhiX174, seed=174)
library(Biostrings)
  data(srPhiX174)
  srPhiX174_halfrc = reverse_complement(srPhiX174, seed=174)

Model of positional bias that can arise when RNA-seq is performed using protocols relying on RNA fragmentation.

Description

This positional bias model was estimated in Li and Jiang (2012). With RNA fragmentation, reads are more likely to have come from the middle of the transcript than either end. The probabilities included in this dataset were estimated from Supplementary Figure S3 in Li and Jiang's manuscript. Data points from the figure were inferred and exported as CSV files using WebPlotDigitizer. The CSV files and the code used to process them and create the datasets are available in the Polyester GitHub repository (https://github.com/alyssafrazee/polyester).

Format

data frame with 100 rows and 2 columns. Column 1 is position along a transcript (in percent), while Column 2 is the probability of getting a fragment at that position. Column 2 sums to 1.

References

Li W and Jiang T (2012): Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics 28(22): 2914-2921.

Rohatgi A (2014): WebPlotDigitizer: Version 3.4 of WebPlotDigitizer. ZENODO. 10.5281/zenodo.11835

Get transcript sequences from GTF file and sequence info

Description

Given a GTF file (for transcript structure) and DNA sequences, return a DNAStringSet of transcript sequences

Usage

seq_gtf(
  gtf,
  seqs,
  feature = "transcript",
  exononly = TRUE,
  idfield = "transcript_id",
  attrsep = "; "
)
seq_gtf(
  gtf,
  seqs,
  feature = "transcript",
  exononly = TRUE,
  idfield = "transcript_id",
  attrsep = "; "
)

Arguments

`gtf`	one of path to GTF file, or data frame representing a canonical GTF file.
`seqs`	one of path to folder containing one FASTA file (`.fa` extension) for each chromosome in `gtf`, or named DNAStringSet containing one DNAString per chromosome in `gtf`, representing its sequence. In the latter case, `names(seqs)` should contain the same entries as the `seqnames` (first) column of `gtf`.
`feature`	one of `'transcript'` or `'exon'` (default transcript), depending on desired return.
`exononly`	if `TRUE` (as it is by default), only create transcript sequences from the features labeled `exon` in `gtf`.
`idfield`	in the `attributes` column of `gtf`, what is the name of the field identifying transcripts? Should be character. Default `"transcript_id"`.
`attrsep`	in the `attributes` column of `gtf`, how are attributes separated? Default `"; "`.

Value

If feature is 'transcript', DNAStringSet containing transcript sequences, with names corresponding to idfield in gtf. If feature is 'exon', DNAStringSet containing exon sequences from gtf, named by exon location (chr, start, end, strand).

References

http://www.ensembl.org/info/website/upload/gff.html

Examples

 ## Not run: 
  library(Biostrings)
  system('wget https://www.dropbox.com/s/04i6msi9vu2snif/chr22seq.rda')
  load('chr22seq.rda')
  data(gtf_dataframe)
  chr22_processed = seq_gtf(gtf_dataframe, chr22seq)

## End(Not run)
## Not run: 
  library(Biostrings)
  system('wget https://www.dropbox.com/s/04i6msi9vu2snif/chr22seq.rda')
  load('chr22seq.rda')
  data(gtf_dataframe)
  chr22_processed = seq_gtf(gtf_dataframe, chr22seq)

## End(Not run)

simulate RNA-seq experiment using negative binomial model

Description

create FASTA files containing RNA-seq reads simulated from provided transcripts, with optional differential expression between two groups

Usage

simulate_experiment(
  fasta = NULL,
  gtf = NULL,
  seqpath = NULL,
  outdir = ".",
  num_reps = c(10, 10),
  reads_per_transcript = 300,
  size = NULL,
  fold_changes,
  paired = TRUE,
  reportCoverage = FALSE,
  ...
)
simulate_experiment(
  fasta = NULL,
  gtf = NULL,
  seqpath = NULL,
  outdir = ".",
  num_reps = c(10, 10),
  reads_per_transcript = 300,
  size = NULL,
  fold_changes,
  paired = TRUE,
  reportCoverage = FALSE,
  ...
)

Arguments

`fasta`	path to FASTA file containing transcripts from which to simulate reads. See details.
`gtf`	path to GTF file containing transcript structures from which reads should be simulated. See details.
`seqpath`	path to folder containing one FASTA file (`.fa` extension) for each chromosome in `gtf`. See details.
`outdir`	character, path to folder where simulated reads should be written, with no slash at the end. By default, reads are written to current working directory.
`num_reps`	How many biological replicates should be in each group? The length `num_reps` determines how many groups are in the experiment. For example, `num_reps = c(5,6,5)` specifies a 3-group experiment with 5 samples in group 1, 6 samples in group 2, and 5 samples in group 3. Defaults to a 2-group experiment with 10 reps per group (i.e., `c(10,10)`).
`reads_per_transcript`	baseline mean number of reads to simulate from each transcript. Can be an integer, in which case this many reads are simulated from each transcript, or an integer vector whose length matches the number of transcripts in `fasta`. Default 300. You can also leave `reads_per_transcript` empty and set `meanmodel=TRUE` to draw baseline mean numbers from a model based on transcript length.
`size`	the negative binomial `size` parameter (see `NegBinomial`) for the number of reads drawn per transcript. It can be a matrix (where the user can specify the size parameter per transcript, per group), a vector (where the user can specify the size per transcript, perhaps relating to reads_per_transcript), or a single number, specifying the size for all transcripts and groups. If left NULL, defaults to `reads_per_transcript * fold_changes / 3`. Negative binomial variance is mean + mean^2 / size.
`fold_changes`	Matrix specifying multiplicative fold changes between groups. There is no default, so you must provide this argument. In real data sets, lowly-expressed transcripts often show high fold changes between groups, so this can be kept in mind when setting `fold_changes` and `reads_per_transcript`. This argument must have the same number of columns as there are groups as specified by `num_reps`, and must have the same number of rows as there are transcripts in `fasta`. A fold change of X in matrix entry i,j means that for replicate j, the baseline mean number of reads (reads_per_transcript[i]) will be multiplied by X. Note that the multiplication happens before the negative binomial value (for the number of reads that actually will be drawn from transcript i, for replicate j) is drawn. This argument is ignored if `length(num_reps)` is 1 (meaning you only have 1 group in your simulation).
`paired`	If `TRUE`, paired-end reads are simulated; else single-end reads are simulated. Default `TRUE`
`reportCoverage`	whether to write out coverage information to `sample_coverages.rda` file in the `outdir`. defaults to `FALSE`
`...`	any of several other arguments that can be used to add nuance to the simulation. See details.

Details

Reads can either be simulated from a FASTA file of transcripts (provided with the fasta argument) or from a GTF file plus DNA sequences (provided with the gtf and seqpath arguments). Simulating from a GTF file and DNA sequences may be a bit slower: it took about 6 minutes to parse the GTF/sequence files for chromosomes 1-22, X, and Y in hg19.

Several optional parameters can be passed to this function to adjust the simulation. The options are:

readlen: read length. Default 100.
lib_sizes: Library size factors for the biological replicates. lib_sizes should have length equal to the total number of replicates in the experiment, i.e., sum(num_reps). For each replicate, once the number of reads to simulate from each transcript for that replicate is known, all read numbers across all transcripts from that replicate are multiplied by the corresponding entry in lib_sizes.
distr One of 'normal', 'empirical', or 'custom', which specifies the distribution from which to draw RNA fragment lengths. If 'normal', draw fragment lengths from a normal distribution. You can provide the mean of that normal distribution with fraglen (defaults to 250) and the standard deviation of that normal distribution with fragsd (defaults to 25). You can provide a single number for each, or a vector with length equal to the total number of samples. If 'empirical', draw fragment lengths from a fragment length distribution estimated from a real data set. If 'custom', draw fragment lengths from a custom distribution, which you can provide as the custdens argument. custdens should be a density fitted using logspline.
error_model: The error model can be one of:
- 'uniform': errors are distributed uniformly across reads. You can also provide an 'error_rate' parameter, giving the overall probability of making a sequencing error at any given nucleotide. This error rate defaults to 0.005.
- 'illumina4' or 'illumina5': Empirical error models. See ?add_platform_error for more information.
- 'custom': A custom error model you've estimated from an RNA-seq data set using GemErr. See ?add_platform_error for more info. You will need to provide both model_path and model_prefix if using a custom error model. model_path is the output folder you provided to build_error_model.py. This path should contain either two files suffixed _mate1 and _mate2, or a file suffixed _single. model_prefix is the 'prefix' argument you provided to build_error_model.py and is whatever comes before the _mate1/_mate2 or _single files in model_path.
bias One of 'none', 'rnaf', or 'cdnaf'. 'none' represents uniform fragment selection (every possible fragment in a transcript has equal probability of being in the experiment); 'rnaf' represents positional bias that arises in protocols using RNA fragmentation, and 'cdnaf' represents positional bias arising in protocols that use cDNA fragmentation (Li and Jiang 2012). Using the 'rnaf' model, coverage is higher in the middle of the transcript and lower at both ends, and in the 'cdnaf' model, coverage increases toward the 3' end of the transcript. The probability models used come from Supplementary Figure S3 of Li and Jiang (2012). Defaults to 'none' if you don't provide this.
gcbias list indicating which samples to add GC bias to, and from which models. Should be the same length as sum(num_reps); entries can be either numeric or of class loess. A numeric entry of 0 indicates no GC bias. Numeric entries 1 through 7 correspond to the 7 empirical GC models that ship with Polyester, estimated from GEUVADIS HapMap samples NA06985, NA12144, NA12776, NA18858, NA20542, NA20772, and NA20815, respectively. The code used to derive the empirical GC models is available at https://github.com/alyssafrazee/polyester/blob/master/make_gc_bias.R. A loess entry should be a loess prediction model that takes a GC content percent value (between 0 and 1) a transcript's deviation from overall mean read count based on that GC value. Counts for each replicate will be adjusted based on the GC bias model specified for it. Numeric and loess entries can be mixed. By default, no bias is included.
frag_GC_bias Either a matrix of dimensions 101 x sum(num_reps) or 'none'. The default is 'none'. If specified, the matrix contains the probabilities (a number in the range [0,1]) that a fragment will appear in the output given its GC content. The first row corresponds to a fragment with GC content of 0 percent, the second row 1 percent, the third row 2 percent, etc., and the last row 100 percent. The columns correspond to different probabilites for each sample. Internally, a coin flip (a Bernoulli trial) determines if each fragment is kept, depending on its GC content. Note that the final library size will depend on the elements of the matrix, and it might make sense to scale up the lib_size of the samples with low probabilites in the matrix in the range of the transcriptome GC content distribution. Note that the count_matrix written to outdir contains the counts before applying fragment GC bias.
strand_specific defaults to FALSE, which means fragments are generated with equal probability from both strands of the transcript sequence. set to TRUE for strand-specific simulation (1st read forward strand, 2nd read reverse strand with respect to transcript sequence).
meanmodel: set to TRUE if you'd like to set reads_per_transcripts as a function of transcript length. We fit a linear model regressing transcript abundance on transcript length, and setting meanmodel=TRUE means we will use transcript lengths to draw transcript abundance based on that linear model. You can see our modeling code at http://htmlpreview.github.io/?https://github.com/alyssafrazee/polyester_code/blob/master/length_simulation.html
write_info: set to FALSE if you do not want files of simulation information written to disk. By default, transcript fold changes and expression status, replicate library sizes and group identifiers, and an R data object of the counts matrix (before application of fragment GC bias) are written to outdir.
seed: specify a seed (e.g. seed=142 or some other integer) to set before randomly drawing read numbers, for reproducibility.
transcriptid: optional vector of transcript IDs to be written into sim_info.txt and used as transcript identifiers in the output fasta files. Defaults to names(readDNAStringSet(fasta)). This option is useful if default names are very long or contain special characters.
gzip: pass gzip=TRUE to write gzipped fasta files as output (by default, fasta output files are not compressed when written to disk).
exononly: (passed to seq_gtf) if TRUE (as it is by default), only create transcript sequences from the features labeled exon in gtf.
idfield: (passed to seq_gtf)in the attributes column of gtf, what is the name of the field identifying transcripts? Should be character. Default "transcript_id".
attrsep: (passed to seq_gtf) in the attributes column of gtf, how are attributes separated? Default "; ".

Value

No return, but simulated reads and a simulation info file are written to outdir. Note that reads are written out transcript by transcript and so need to be shuffled if used as input to quantification algorithms such as eXpress or Salmon.

References

't Hoen PA, et al (2013): Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nature Biotechnology 31(11): 1015-1022.

Li W and Jiang T (2012): Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics 28(22): 2914-2921.

McElroy KE, Luciani F and Thomas T (2012): GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics 13(1), 74.

Examples


  ## simulate a few reads from chromosome 22

  fastapath = system.file("extdata", "chr22.fa", package="polyester")
  numtx = count_transcripts(fastapath)
  set.seed(4)
  fold_change_values = sample(c(0.5, 1, 2), size=2*numtx,
     prob=c(0.05, 0.9, 0.05), replace=TRUE)
  fold_changes = matrix(fold_change_values, nrow=numtx)
  library(Biostrings)
  # remove quotes from transcript IDs:
  tNames = gsub("'", "", names(readDNAStringSet(fastapath)))

  simulate_experiment(fastapath, reads_per_transcript=10,
     fold_changes=fold_changes, outdir='simulated_reads',
     transcriptid=tNames, seed=12)


## simulate a few reads from chromosome 22

  fastapath = system.file("extdata", "chr22.fa", package="polyester")
  numtx = count_transcripts(fastapath)
  set.seed(4)
  fold_change_values = sample(c(0.5, 1, 2), size=2*numtx,
     prob=c(0.05, 0.9, 0.05), replace=TRUE)
  fold_changes = matrix(fold_change_values, nrow=numtx)
  library(Biostrings)
  # remove quotes from transcript IDs:
  tNames = gsub("'", "", names(readDNAStringSet(fastapath)))

  simulate_experiment(fastapath, reads_per_transcript=10,
     fold_changes=fold_changes, outdir='simulated_reads',
     transcriptid=tNames, seed=12)

Simulate RNA-seq experiment

Description

create FASTA files containing RNA-seq reads simulated from provided transcripts, with optional differential expression between two groups (designated via read count matrix)

Usage

simulate_experiment_countmat(
  fasta = NULL,
  gtf = NULL,
  seqpath = NULL,
  readmat,
  outdir = ".",
  paired = TRUE,
  seed = NULL,
  ...
)
simulate_experiment_countmat(
  fasta = NULL,
  gtf = NULL,
  seqpath = NULL,
  readmat,
  outdir = ".",
  paired = TRUE,
  seed = NULL,
  ...
)

Arguments

`fasta`	path to FASTA file containing transcripts from which to simulate reads. See details.
`gtf`	path to GTF file or data frame containing transcript structures from which reads should be simulated. See details and `seq_gtf`.
`seqpath`	path to folder containing one FASTA file (`.fa` extension) or DNAStringSet containing one entry for each chromosome in `gtf`. See details and `seq_gtf`.
`readmat`	matrix with rows representing transcripts and columns representing samples. Entry i,j specifies how many reads to simulate from transcript i for sample j.
`outdir`	character, path to folder where simulated reads should be written, without a slash at the end of the folder name. By default, reads written to the working directory.
`paired`	If `TRUE`, paired-end reads are simulated; else single-end reads are simulated.
`seed`	Optional seed to set before simulating reads, for reproducibility.
`...`	Additional arguments to add nuance to the simulation, as described extensively in the details of `simulate_experiment`, or to pass to `seq_gtf`, if `gtf` is not `NULL`.

Details

Value

No return, but simulated reads are written to outdir.

References

Li W and Jiang T (2012): Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics 28(22): 2914-2921.

Examples


  fastapath = system.file("extdata", "chr22.fa", package="polyester")
  numtx = count_transcripts(fastapath)
  readmat = matrix(20, ncol=10, nrow=numtx)
  readmat[1:30, 1:5] = 40

  simulate_experiment_countmat(fasta=fastapath, 
    readmat=readmat, outdir='simulated_reads_2', seed=5)

fastapath = system.file("extdata", "chr22.fa", package="polyester")
  numtx = count_transcripts(fastapath)
  readmat = matrix(20, ncol=10, nrow=numtx)
  readmat[1:30, 1:5] = 40

  simulate_experiment_countmat(fasta=fastapath, 
    readmat=readmat, outdir='simulated_reads_2', seed=5)

Simulate RNA-seq experiment based on abundances from a data set

Description

Create fasta files representing reads from a two-group experiment, where abundances and differential expression are estimated from a real data set

Usage

simulate_experiment_empirical(
  bg = NULL,
  fpkmMat = NULL,
  mean_rps = 5e+06,
  grouplabels = NULL,
  decut = 1.5,
  outdir = ".",
  ...
)
simulate_experiment_empirical(
  bg = NULL,
  fpkmMat = NULL,
  mean_rps = 5e+06,
  grouplabels = NULL,
  decut = 1.5,
  outdir = ".",
  ...
)

Arguments

`bg`	Ballgown object containing estimated transcript abundances in FPKM. Reads will be simulated for the same number of replicates that are in `bg`. Must provide exactly one of `bg` and `fpkmMat`.
`fpkmMat`	transcript-by-sample matrix containing abundances (in FPKM) estimated from a real data set. MUST have row names identifying transcripts. The number of columns is the number of samples that will be simulated.
`mean_rps`	Number of reads per sample to use in converting FPKM measurements to counts. Should be somewhat close to the number of reads per sample in the experiment that generated the estimated FPKMs. Defaults to 5 million (5e6).
`grouplabels`	vector indicating the group labels for each replicate in the experiment. Must be convertible to a factor with exactly 2 levels.
`decut`	A transcript will be recorded as truly differentially expressed if its fold change between the two groups is more extreme than `decut`, in either direction.
`outdir`	character, path to folder where simulated reads should be written, without a slash at the end of the folder name. By default, reads written to the working directory.
`...`	Additional arguments to pass to simulate_experiment_countmat

Value

No return, but reads are written to outdir.

Examples

## Not run: 

  library(ballgown) 
  data(bg)
  bg = subset(bg, "chr=='22'")
  
  # load gtf file:
  gtfpath = system.file('extdata', 'bg.gtf.gz', package='polyester')
  gtf = subset(gffRead(gtfpath), seqname=='22')
 
  # load/download chromosome sequence (just for this example)
  system('wget https://www.dropbox.com/s/04i6msi9vu2snif/chr22seq.rda')
  load('chr22seq.rda')
  names(chr22seq) = '22'

  # simulate reads based on this experiment's FPKMs
  simulate_experiment_empirical(bg, grouplabels=pData(bg)$group, gtf=gtf, 
     seqpath=chr22seq, mean_rps=5000, outdir='simulated_reads_3', seed=1247)

## End(Not run)
## Not run: 

  library(ballgown) 
  data(bg)
  bg = subset(bg, "chr=='22'")
  
  # load gtf file:
  gtfpath = system.file('extdata', 'bg.gtf.gz', package='polyester')
  gtf = subset(gffRead(gtfpath), seqname=='22')
 
  # load/download chromosome sequence (just for this example)
  system('wget https://www.dropbox.com/s/04i6msi9vu2snif/chr22seq.rda')
  load('chr22seq.rda')
  names(chr22seq) = '22'

  # simulate reads based on this experiment's FPKMs
  simulate_experiment_empirical(bg, grouplabels=pData(bg)$group, gtf=gtf, 
     seqpath=chr22seq, mean_rps=5000, outdir='simulated_reads_3', seed=1247)

## End(Not run)

write sequencing reads to disk

Description

given a DNAStringSet representing simulated sequencing reads, write FASTA files to disk representing the simulated reads.

Usage

write_reads(reads, fname, readlen, paired = TRUE, gzip, offset = 1L)
write_reads(reads, fname, readlen, paired = TRUE, gzip, offset = 1L)

Arguments

`reads`	DNAStringSet representing sequencing reads
`fname`	file path/prefix specifying where sequencing reads should be written. Should not contain ".fasta" (this is appended automatically).
`readlen`	maximum length of the reads in `reads`.
`paired`	If `TRUE`, reads are assumed to be in pairs: i.e., read 1 and read 2 in `reads` are the left and right mate (respectively) of a read pair; same with read 3 and read 4, etc. The odd-numbered reads are written to `fname_1.fasta` and the even-numbered reads are written to `fname_2.fasta`. If `FALSE`, reads are assumed to be single-end and just one file, `fname.fasta`, is written.
`gzip`	If `TRUE`, gzip the output fasta files.
`offset`	An integer number greater or equal to 1 to start assigning read numbers at.

Details

The get_reads function returns a DNAStringSet object representing sequencing reads that can be directly passed to write_reads. If output other than that from get_reads is used and paired is TRUE, make sure reads is ordered properly (i.e., that mate pairs appear together and that the left mate appears first).

Value

No return, but FASTA file(s) containing the sequences in reads are written to fname.fasta (if paired is FALSE) or fname_1.fasta and fname_2.fasta if paired is TRUE.

Examples

  library(Biostrings)
  data(srPhiX174) # pretend srPhiX174 represents a DNAStringSet of *reads*
  readlen = unique(width(srPhiX174)) #35
  write_reads(srPhiX174, fname='./srPhiX174', readlen=readlen, paired=FALSE,
      gzip=FALSE)

## If the file is too big, you can subset it and write it in chunks.
## Here we split our 'reads' into two chunks and save them to the same file.
write_reads(srPhiX174[1:100], fname='./srPhiX174-offset', readlen=readlen,
   paired=FALSE, gzip=FALSE, offset = 1L)
write_reads(srPhiX174[101:length(srPhiX174)], fname='./srPhiX174-offset',
   readlen=readlen, paired=FALSE, gzip=FALSE, offset = 101L)

## We can verify that we get the same results
srPhi <- readDNAStringSet('./srPhiX174.fasta')
srPhiOffset <- readDNAStringSet('./srPhiX174-offset.fasta')
identical(srPhi, srPhiOffset)

library(Biostrings)
  data(srPhiX174) # pretend srPhiX174 represents a DNAStringSet of *reads*
  readlen = unique(width(srPhiX174)) #35
  write_reads(srPhiX174, fname='./srPhiX174', readlen=readlen, paired=FALSE,
      gzip=FALSE)

## If the file is too big, you can subset it and write it in chunks.
## Here we split our 'reads' into two chunks and save them to the same file.
write_reads(srPhiX174[1:100], fname='./srPhiX174-offset', readlen=readlen,
   paired=FALSE, gzip=FALSE, offset = 1L)
write_reads(srPhiX174[101:length(srPhiX174)], fname='./srPhiX174-offset',
   readlen=readlen, paired=FALSE, gzip=FALSE, offset = 101L)

## We can verify that we get the same results
srPhi <- readDNAStringSet('./srPhiX174.fasta')
srPhiOffset <- readDNAStringSet('./srPhiX174-offset.fasta')
identical(srPhi, srPhiOffset)

Package 'polyester'

Help Index

add sequencing error to simulated reads

Description

Usage

Arguments

Value

Examples

add GC bias to a count matrix

Description

Usage

Arguments

Details

Value

Examples

Simulate sequencing error using empirical error model

Description

Usage

Arguments

Details

Value

References

See Also

Examples

Model of positional bias that can arise when RNA-seq is performed using protocols relying on cDNA fragmentation.

Description

Format

References

determine how many transcripts are annotated in a FASTA or GTF file

Description

Usage

Arguments

Value

Examples

Generate a simulated data set based on known model parameters

Description

Usage

Arguments

Value

Author(s)

Examples

Estimated distribution of fragment lengths

Description

Format

References

Turn FPKMs from a ballgown object into estimated counts for transcripts

Description

Usage

Arguments

Details

Value

Author(s)

Examples

generate a set of fragments from a set of transcripts

Description

Usage

Arguments

Details

Value

References

See Also

Examples

Estimate zero-inflated negative binomial parameters from a real dataset

Description

Usage

Arguments

Value

Author(s)

Examples

get sequencing reads from fragments

Description

Usage

Arguments

Value

See Also

Examples

extract a specific field of the "attributes" column of a data frame created from a GTF/GFF file

Description

Usage

Arguments