Title: | Tools for Adaptive Immune Receptor Sequence-Based Keras Modeling |
---|---|
Description: | A set of tools to build tensorflow/keras3-based models in R from amino acid and nucleotide sequences focusing on adaptive immune receptors. The package includes pre-processing of sequences, unifying gene nomenclature usage, encoding sequences, and combining models. This package will serve as the basis of future immune receptor sequence functions/packages/models compatible with the scRepertoire ecosystem. |
Authors: | Nick Borcherding [aut, cre] |
Maintainer: | Nick Borcherding <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.0 |
Built: | 2024-10-31 03:48:44 UTC |
Source: | https://github.com/bioc/immApex |
Calculate frequency of adjacency between residues along a set of biological sequences.
adjacencyMatrix( input.sequences = NULL, normalize = TRUE, sequence.dictionary = amino.acids )
adjacencyMatrix( input.sequences = NULL, normalize = TRUE, sequence.dictionary = amino.acids )
input.sequences |
The amino acid or nucleotide sequences to use |
normalize |
Return the values as a function of total number of residues (TRUE) or frequencies (FALSE) |
sequence.dictionary |
The letters to use in sequence generation (default are all amino acids) |
Adjacency matrix based on input.sequences.
new.sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) adj.matrix <- adjacencyMatrix(new.sequences, normalize = TRUE)
new.sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) adj.matrix <- adjacencyMatrix(new.sequences, normalize = TRUE)
This function will format the genes into a clean nomenclature using the IMGT conventions.
formatGenes( input.data, region = "v", technology = NULL, species = "human", simplify.format = TRUE )
formatGenes( input.data, region = "v", technology = NULL, species = "human", simplify.format = TRUE )
input.data |
Data frame of sequencing data or scRepertoire outputs |
region |
Sequence gene loci to access - "v", "d", "j", or "c" or a combination using c("v", "d", "j") |
technology |
The sequencing technology employed - 'TenX', "Adaptive', or 'AIRR' |
species |
One or two word designation of species. Currently supporting: "human", "mouse", "rat", "rabbit", "rhesus monkey", "sheep", "pig", "platypus", "alpaca", "dog", "chicken", and "ferret" |
simplify.format |
If applicable, remove the allelic designation (TRUE) or retain all information (FALSE) |
A data frame with the new columns of formatted genes added.
data(immapex_example.data) formatGenes(immapex_example.data[["TenX"]], region = "v", technology = "TenX")
data(immapex_example.data) formatGenes(immapex_example.data[["TenX"]], region = "v", technology = "TenX")
Use this to make synthetic amino acid sequences for purposes of testing code, training models, or noise.
generateSequences( prefix.motif = NULL, suffix.motif = NULL, number.of.sequences = 100, min.length = 1, max.length = 10, sequence.dictionary = amino.acids )
generateSequences( prefix.motif = NULL, suffix.motif = NULL, number.of.sequences = 100, min.length = 1, max.length = 10, sequence.dictionary = amino.acids )
prefix.motif |
Add defined amino acid/nucleotide sequence to the start of the generated sequences. |
suffix.motif |
Add defined amino acid/nucleotide sequence to the end of the generated sequences |
number.of.sequences |
Number of sequences to generate |
min.length |
Minimum length of the final sequence. The min.length may be adjusted if incongruent with prefix.motif/suffix.motif lengths |
max.length |
Maximum length of the final sequence |
sequence.dictionary |
The letters to use in sequence generation (default are all amino acids) |
A vector of generated sequences
generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16)
generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16)
Use this to transform amino acid sequences into a geometric encoding of the sequence.
geometricEncoder( input.sequences, method.to.use = "BLOSUM62", theta = pi/3, verbose = TRUE )
geometricEncoder( input.sequences, method.to.use = "BLOSUM62", theta = pi/3, verbose = TRUE )
input.sequences |
The set of amino acid sequences |
method.to.use |
The method or approach for the conversion:
|
theta |
The angle for geometric transformation |
verbose |
Print messages corresponding to the processing step |
Geometric encoded amino acid sequences in a matrix
new.sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) sequence.matrix <- geometricEncoder(new.sequences, method.to.use = "BLOSUM62", theta = pi/3)
new.sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) sequence.matrix <- geometricEncoder(new.sequences, method.to.use = "BLOSUM62", theta = pi/3)
Use this to access the ImMunoGeneTics (IMGT) sequences for a specific species and gene loci. More information on IMGT can be found at imgt.org.
getIMGT( species = "human", chain = "TRB", sequence.type = "aa", frame = "inframe", region = "v", max.retries = 3, verbose = TRUE )
getIMGT( species = "human", chain = "TRB", sequence.type = "aa", frame = "inframe", region = "v", max.retries = 3, verbose = TRUE )
species |
One or two-word common designation of species. |
chain |
Sequence chain to access, e.g., TRB or IGH. |
sequence.type |
Type of sequence - aa (amino acid) or nt (nucleotide). |
frame |
Designation for all, inframe, or inframe+gap. |
region |
Gene loci to access. |
max.retries |
Number of attempts to fetch data in case of failure. |
verbose |
Print messages corresponding to the processing step. |
A list of allele sequences.
TRBV_aa <- getIMGT(species = "human", chain = "TRB", frame = "inframe", region = "v", sequence.type = "aa", max.retries = 3)
TRBV_aa <- getIMGT(species = "human", chain = "TRB", frame = "inframe", region = "v", sequence.type = "aa", max.retries = 3)
Use this to extract immune receptor sequences from a Single-Cell Object or the output of combineTCR and combineBCR.
getIR(input.data, chains, sequence.type = "aa")
getIR(input.data, chains, sequence.type = "aa")
input.data |
Single-cell object or the output of combineTCR and combineBCR from scRepertoire |
chains |
Immune Receptor chain to use - TRA, TRB, IGH, or IGL |
sequence.type |
Extract amino acid (aa) or nucleotide (nt) sequences |
A data frame of nucleotide or amino acid sequences
A list of amino acid properties that are
used for propertyEncoder
function.
This includes:
atchleyFactors
crucianiProperties
FASGAI
kideraFactors
MSWHIM
ProtFP
stScales
tScales
VHSE
zScales
data("immapex_AA.data")
data("immapex_AA.data")
List of 10 amino acid properties for 20 amino acids
A list of amino acid substitution matrices, using the Point Accepted Matrix (PAM) and BLOck SUbstitution Matrix (BLOSUM) approaches. A discussion and comparison of these matrices are available at PMID: 21356840.
BLOSUM45
BLOSUM50
BLOSUM62
BLOSUM80
BLOSUM100
PAM30
PAM40
PAM70
PAM120
PAM250
data("immapex_blosum.pam.matrices")
data("immapex_blosum.pam.matrices")
List of 10 substitution matrices
Contains a collection of bulk or paired TCR sequences in the respective formats in the form of a list from the following sources:
TenX: 10k_Human_DTC_Melanoma_5p_nextgem_Multiplex from 10x Website.
AIRR: Human_colon_16S8157851 from PMID: 37055623.
Adaptive: Adaptive_2283_D0 from PMID: 36220826.
More information on the data formats are available: AIRR, Adaptive, and TenX.
data("immapex_example.data")
data("immapex_example.data")
List of 3 example data sets for 10x, AIRR and Adaptive contigs.
A list of regularized gene nomenclature to use for converting for data for uniformity. Data is organize by gene region, loci and species. Not all species are represented in the data and pseudogenes have not been removed.
data("immapex_gene.list")
data("immapex_gene.list")
List of gene nomenclature by region, loci, and species.
Use this isolate sequences from the CDR loop using the V gene annotation. When there are multiple V gene matches for a single gene, the first allelic sequence is used.
inferCDR( input.data, reference = NULL, chain = "TRB", technology = NULL, sequence.type = "aa", sequences = c("CDR1", "CDR2") )
inferCDR( input.data, reference = NULL, chain = "TRB", technology = NULL, sequence.type = "aa", sequences = c("CDR1", "CDR2") )
input.data |
Data frame output of |
reference |
IMGT reference sequences from |
chain |
Sequence chain to access, like TRB or IGH |
technology |
The sequencing technology employed - TenX, Adaptive, AIRR, or Omniscope |
sequence.type |
Type of sequence - aa for amino acid or nt for nucleotide |
sequences |
The specific regions of the CDR loop to get from the data, such as CDR1. |
A data frame with the new columns of CDR sequences added.
# Getting the Sequence Reference data(immapex_example.data) TRBV_aa <- getIMGT(species = "human", chain = "TRB", frame = "inframe", region = "v", sequence.type = "aa") # Ensuring sequences are formatted to IMGT TenX_formatted <- formatGenes(immapex_example.data[["TenX"]], region = "v", technology = "TenX") # Inferring CDR loop elements TenX_formatted <- inferCDR(TenX_formatted, chain = "TRB", reference = TRBV_aa, technology = "TenX", sequence.type = "aa", sequences = c("CDR1", "CDR2"))
# Getting the Sequence Reference data(immapex_example.data) TRBV_aa <- getIMGT(species = "human", chain = "TRB", frame = "inframe", region = "v", sequence.type = "aa") # Ensuring sequences are formatted to IMGT TenX_formatted <- formatGenes(immapex_example.data[["TenX"]], region = "v", technology = "TenX") # Inferring CDR loop elements TenX_formatted <- inferCDR(TenX_formatted, chain = "TRB", reference = TRBV_aa, technology = "TenX", sequence.type = "aa", sequences = c("CDR1", "CDR2"))
Use this to mutate or mask sequences for purposes of testing code, training models, or noise.
mutateSequences( input.sequences, n.sequences = 1, mutation.rate = 0.01, position.start = NULL, position.end = NULL, sequence.dictionary = amino.acids )
mutateSequences( input.sequences, n.sequences = 1, mutation.rate = 0.01, position.start = NULL, position.end = NULL, sequence.dictionary = amino.acids )
input.sequences |
The amino acid or nucleotide sequences to use |
n.sequences |
The number of mutated sequences to return |
mutation.rate |
The rate of mutations to introduce into sequences |
position.start |
The starting position to mutate along the sequence Default = NULL will start the random mutations at position 1 |
position.end |
The ending position to mutate along the sequence Default = NULL will end the random mutations at the last position |
sequence.dictionary |
The letters to use in sequence mutation (default are all amino acids) |
A vector of mutated sequences
sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) mutated_sequences <- mutateSequences(sequences, n.sequence = 1, position.start = 3, position.end = 8)
sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) mutated_sequences <- mutateSequences(sequences, n.sequence = 1, position.start = 3, position.end = 8)
Use this to transform amino acid or nucleotide sequences into a one hot encoding of the sequence.
onehotEncoder( input.sequences, max.length = NULL, motif.length = 1, convert.to.matrix = TRUE, sequence.dictionary = amino.acids, padding.symbol = ".", verbose = TRUE )
onehotEncoder( input.sequences, max.length = NULL, motif.length = 1, convert.to.matrix = TRUE, sequence.dictionary = amino.acids, padding.symbol = ".", verbose = TRUE )
input.sequences |
The amino acid or nucleotide sequences to use |
max.length |
Additional length to pad, NULL will pad sequences to the max length of input.sequences |
motif.length |
The length of the amino acid residues to encode - a motif.length = 1 produces single amino acid encodings |
convert.to.matrix |
Return a matrix (TRUE) or a 3D array (FALSE) |
sequence.dictionary |
The letters to use in sequence generation (default are all amino acids). This will be overrode if using a motif approach (motif.length > 1) |
padding.symbol |
Symbol to use for padding at the end of sequences |
verbose |
Print messages corresponding to the processing step |
One hot encoded sequences in a matrix or 3D array
new.sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) sequence.matrix <- onehotEncoder(new.sequences, convert.to.matrix = TRUE)
new.sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) sequence.matrix <- onehotEncoder(new.sequences, convert.to.matrix = TRUE)
Use this calculate positional encoding for recurrent neural networks using sin/cos and position information.
positionalEncoder(number.of.sequences, latent.dims = NULL)
positionalEncoder(number.of.sequences, latent.dims = NULL)
number.of.sequences |
The number of sequences to generate position information |
latent.dims |
The number of latent dimensions. |
A matrix of values
position.info <- positionalEncoder(number.of.sequences = 1000, latent.dims = 64)
position.info <- positionalEncoder(number.of.sequences = 1000, latent.dims = 64)
Use this to generate a position-probability or weight matrix for a set of given sequences.
probabilityMatrix( input.sequences, max.length = NULL, convert.PWM = FALSE, background.frequencies = NULL, sequence.dictionary = amino.acids, padding.symbol = ".", verbose = TRUE )
probabilityMatrix( input.sequences, max.length = NULL, convert.PWM = FALSE, background.frequencies = NULL, sequence.dictionary = amino.acids, padding.symbol = ".", verbose = TRUE )
input.sequences |
The amino acid or nucleotide sequences to use |
max.length |
Additional length to pad, NULL will pad sequences to the max length of input.sequences |
convert.PWM |
Convert the matrix into a positional weight matrix using log likelihood |
background.frequencies |
Provide amino acid or nucleotide frequencies for the positional weight matrix. If NULL, assumes uniform likelihood. |
sequence.dictionary |
The letters to use in sequence generation (default are all amino acids) |
padding.symbol |
Symbol to use for padding at the end of sequences |
verbose |
Print messages corresponding to the processing step |
A matrix with position specific probabilities or weights
new.sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) PPM.matrix <- probabilityMatrix(new.sequences)
new.sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) PPM.matrix <- probabilityMatrix(new.sequences)
Use this to transform amino acid sequences a a matrix by amino acid properties derived from dimensional reduction strategies
propertyEncoder( input.sequences, max.length = NULL, method.to.use = NULL, convert.to.matrix = TRUE, summary.function = NULL, padding.symbol = ".", verbose = TRUE )
propertyEncoder( input.sequences, max.length = NULL, method.to.use = NULL, convert.to.matrix = TRUE, summary.function = NULL, padding.symbol = ".", verbose = TRUE )
input.sequences |
The amino acid sequences to use |
max.length |
Additional length to pad, NULL will pad sequences to the max length of input.sequences |
method.to.use |
The method or approach to use for the conversion:
|
convert.to.matrix |
Return a matrix (TRUE) or a 3D array (FALSE) |
summary.function |
Return a matrix that summarize the amino acid method/property Available summaries include: "median", "mean", "sum", variance ("vars"), or Median Absolute Deviation ("mads") |
padding.symbol |
Symbol to use for padding at the end of sequences |
verbose |
Print messages corresponding to the processing step |
Converted amino acid sequences by property in a matrix or 3D array
new.sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) sequence.matrix <- propertyEncoder(new.sequences, method.to.use = "VHSE", convert.to.matrix = TRUE)
new.sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) sequence.matrix <- propertyEncoder(new.sequences, method.to.use = "VHSE", convert.to.matrix = TRUE)
Use this to transform one hot encoded sequences back into amino acid or nucleotide sequences.
sequenceDecoder( sequence.matrix, encoder = "onehotEncoder", aa.method.to.use = NULL, call.threshold = 0.5, sequence.dictionary = amino.acids, padding.symbol = ".", remove.padding = TRUE )
sequenceDecoder( sequence.matrix, encoder = "onehotEncoder", aa.method.to.use = NULL, call.threshold = 0.5, sequence.dictionary = amino.acids, padding.symbol = ".", remove.padding = TRUE )
sequence.matrix |
The encoded sequences to decode in an array or matrix |
encoder |
The method to prepare the sequencing information - "onehotEncoder" or "propertyEncoder" |
aa.method.to.use |
The method or approach to use for the conversion:
|
call.threshold |
The relative strictness of sequence calling with higher values being more stringent |
sequence.dictionary |
The letters to use in sequence generation (default are all amino acids) |
padding.symbol |
Symbol to use for padding at the end of sequences |
remove.padding |
Remove the additional symbol from the end of decoded sequences |
Decoded amino acid or nucleotide sequences
new.sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) sequence.matrix <- onehotEncoder(new.sequences, convert.to.matrix = TRUE) decoded.sequences <- sequenceDecoder(sequence.matrix, padding.symbol = ".")
new.sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) sequence.matrix <- onehotEncoder(new.sequences, convert.to.matrix = TRUE) decoded.sequences <- sequenceDecoder(sequence.matrix, padding.symbol = ".")
Use this to transform amino acid sequences into tokens in preparing for deep learning models.
tokenizeSequences( input.sequences, add.startstop = TRUE, start.token = "!", stop.token = "^", max.length = NULL, convert.to.matrix = TRUE, verbose = TRUE )
tokenizeSequences( input.sequences, add.startstop = TRUE, start.token = "!", stop.token = "^", max.length = NULL, convert.to.matrix = TRUE, verbose = TRUE )
input.sequences |
The amino acid or nucleotide sequences to use |
add.startstop |
Add start and stop tokens to the sequence |
start.token |
The character to use for the start token |
stop.token |
The character to use for the stop token |
max.length |
Additional length to pad, NULL will pad sequences to the max length of input.sequences |
convert.to.matrix |
Return a matrix (TRUE) or a vector (FALSE) |
verbose |
Print messages corresponding to the processing step |
Tokenize sequences in a matrix or vector
new.sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) sequence.matrix <- tokenizeSequences(new.sequences, add.startstop = TRUE, start.token = "!", stop.token = "^", convert.to.matrix = TRUE)
new.sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) sequence.matrix <- tokenizeSequences(new.sequences, add.startstop = TRUE, start.token = "!", stop.token = "^", convert.to.matrix = TRUE)
Use this to simulate sequences using a variational autoencoder (VAE) and perturbation of the probability distributions.
variationalSequences( input.sequences, encoder.function = "onehotEncoder", aa.method.to.use = NULL, number.of.sequences = 100, encoder.hidden.dim = c(128, 64), decoder.hidden.dim = NULL, latent.dim = 16, batch.size = 16, epochs = 50, learning.rate = 0.001, epsilon.std = 1, call.threshold = 0.2, activation.function = "relu", optimizer = "adam", disable.eager.execution = FALSE, sequence.dictionary = amino.acids, verbose = TRUE )
variationalSequences( input.sequences, encoder.function = "onehotEncoder", aa.method.to.use = NULL, number.of.sequences = 100, encoder.hidden.dim = c(128, 64), decoder.hidden.dim = NULL, latent.dim = 16, batch.size = 16, epochs = 50, learning.rate = 0.001, epsilon.std = 1, call.threshold = 0.2, activation.function = "relu", optimizer = "adam", disable.eager.execution = FALSE, sequence.dictionary = amino.acids, verbose = TRUE )
input.sequences |
The amino acid or nucleotide sequences to use |
encoder.function |
The method to prepare the sequencing information - "onehotEncoder" or "propertyEncoder" |
aa.method.to.use |
The method or approach to use for the conversion:
|
number.of.sequences |
Number of sequences to generate |
A vector of the neurons to use in the hidden layers for the encoder portion of the model |
|
A vector of the neurons to use in the hidden layers for the decoder portion of the model. If NULL assumes symmetric autoencoder |
|
latent.dim |
The size of the latent dimensions |
batch.size |
The batch size to use for VAE training |
epochs |
The number of epochs to use in VAE training |
learning.rate |
The learning rate to use in VAE training |
epsilon.std |
The epsilon to use in VAE training |
call.threshold |
The relative strictness of sequence calling with higher values being more stringent |
activation.function |
The activation for the dense connected layers |
optimizer |
The optimizer to use in VAE training |
disable.eager.execution |
Disable the eager execution parameter for tensorflow. |
sequence.dictionary |
The letters to use in sequence mutation (default are all amino acids) |
verbose |
Print messages corresponding to the processing step |
A vector of mutated sequences
sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) new.sequences <- variationalSequences(sequences, encoder = "onehotEncoder", encoder.hidden.dim = c(256, 128), latent.dim = 16, batch.size = 16)
sequences <- generateSequences(prefix.motif = "CAS", suffix.motif = "YF", number.of.sequences = 100, min.length = 8, max.length = 16) new.sequences <- variationalSequences(sequences, encoder = "onehotEncoder", encoder.hidden.dim = c(256, 128), latent.dim = 16, batch.size = 16)