Title: | High Throughput Sequencing Cell Authentication Toolkit |
---|---|
Description: | The seqCAT package uses variant calling data (in the form of VCF files) from high throughput sequencing technologies to authenticate and validate the source, function and characteristics of biological samples used in scientific endeavours. |
Authors: | Erik Fasterius [aut, cre] |
Maintainer: | Erik Fasterius <[email protected]> |
License: | MIT + file LICENCE |
Version: | 1.29.0 |
Built: | 2024-10-31 05:23:26 UTC |
Source: | https://github.com/bioc/seqCAT |
Calculate the similarity statistics for SNV profile comparisons.
calculate_similarity(data, similarity = NULL, a = 1, b = 5)
calculate_similarity(data, similarity = NULL, a = 1, b = 5)
data |
The input SNV data dataframe. |
similarity |
Optional dataframe to add results to. |
a |
Similarity score parameter a (integer). |
b |
Similarity score parameter b (integer). |
This function calculates various summary statistics and sample similarities for a given profile comparison dataframe. It returns a small dataframe with the overall similarity score (whose parameters 'a' and 'b' can be adjusted in the function call), total SNV data, the concordance of the data and the sample names in question. This dataframe can also be given to the function, in which case it will simply add another row for the current samples, facilitating downstream aggregate analyses.
A dataframe with summary statistics.
# Load test data data(test_comparison) # Calculate similarities similarity <- calculate_similarity(test_comparison) # Add another row of summary statistics calculate_similarity(test_comparison, similarity = similarity)
# Load test data data(test_comparison) # Calculate similarities similarity <- calculate_similarity(test_comparison) # Add another row of summary statistics calculate_similarity(test_comparison, similarity = similarity)
Overlap and compare genotypes in many SNV profiles.
compare_many(many, one = NULL, a = 1, b = 5)
compare_many(many, one = NULL, a = 1, b = 5)
many |
SNV profiles to be compared (list of dataframes). |
one |
SNV profile to be compared to all others (dataframe). |
a |
Similarity score parameter a (integer). |
b |
Similarity score parameter b (integer). |
This is a function that compares all the combinations of the SNV profiles input to it, either in a one-to-many or many-to-many manner. It returns both a dataframe containing summary statistics for all unique combinations and a list of dataframes with all the performed comparisons, for easy re-use and downstream analyses of said comparisons.
A list of summary statistics and comparisons.
# Load test data data(test_profile_1) data(test_profile_2) # Perform many-to-many comparisons profiles <- list(test_profile_1, test_profile_2) comparisons <- compare_many(profiles) # View aggregate similarities ## Not run: comparisons[[1]]) # View data of first comparison ## Not run: head(comparisons[[2]][[1]])
# Load test data data(test_profile_1) data(test_profile_2) # Perform many-to-many comparisons profiles <- list(test_profile_1, test_profile_2) comparisons <- compare_many(profiles) # View aggregate similarities ## Not run: comparisons[[1]]) # View data of first comparison ## Not run: head(comparisons[[2]][[1]])
Overlap and compare genotypes in two SNV profiles.
compare_profiles(profile_1, profile_2, mode = "intersection")
compare_profiles(profile_1, profile_2, mode = "intersection")
profile_1 |
The first SNV profile (GRanges object). |
profile_2 |
The second SNV profile (GRanges object). |
mode |
Merge profiles using "union" or "intersection" (character). |
This is a function for finding overlapping variants in two different SNV profiles (stored as GenomicRanges objects), followed by comparing the genotypes of the overlapping variants. The "compare_overlaps" function calls the "add_metadata" function twice in succession in order to merge the metadata for the two profiles (supplied as GRanges objects), returns the results as a dataframe, compares the genotypes of the overlapping variants using the "compare_genotypes" function and, finally, returns the final dataframe with all variant overlaps and their similarity.
A dataframe.
# Load test data data(test_profile_1) data(test_profile_2) # Compare the two profiles comparison <- compare_profiles(test_profile_1, test_profile_2)
# Load test data data(test_profile_1) data(test_profile_2) # Compare the two profiles comparison <- compare_profiles(test_profile_1, test_profile_2)
Create an SNV profile from data in a VCF file.
create_profile(vcf_file, sample, min_depth = 10, filter_vc = TRUE, filter_mt = TRUE, filter_ns = TRUE, filter_gd = TRUE, filter_pd = FALSE)
create_profile(vcf_file, sample, min_depth = 10, filter_vc = TRUE, filter_mt = TRUE, filter_ns = TRUE, filter_gd = TRUE, filter_pd = FALSE)
vcf_file |
The VCF file from which the profile will be created (path). |
sample |
The sample in the VCF for which a profile will be created (character). |
min_depth |
Filter variants below this sequencing depth (integer). |
filter_vc |
Filter variants failing variant caller criteria (boolean). |
filter_mt |
Filter mitochondrial variants (boolean). |
filter_ns |
Filter non-standard chromosomes (boolean). |
filter_gd |
Filter duplicate variants at the gene-level (boolean). |
filter_pd |
Filter duplicate variants at the position-level (boolean). |
This function creates a SNV profile from a given VCF file by extracting the variants that pass the filtering criterias. Profile creation is performed to facilitate and accelerate the cell authentication procedures, which is especially relevant when more than one pairwise comparison will be performed on the same sample.
A data frame.
# Path to the test VCF file vcf_file = system.file("extdata", "test.vcf.gz", package = "seqCAT") # Create SNV profiles profile_1 <- create_profile(vcf_file, "sample1") profile_1 <- create_profile(vcf_file, "sample1", min_depth = 15)
# Path to the test VCF file vcf_file = system.file("extdata", "test.vcf.gz", package = "seqCAT") # Create SNV profiles profile_1 <- create_profile(vcf_file, "sample1") profile_1 <- create_profile(vcf_file, "sample1", min_depth = 15)
Create SNV profiles from all VCF files in a directory
create_profiles(vcf_dir, min_depth = 10, filter_vc = TRUE, filter_mt = TRUE, filter_ns = TRUE, filter_gd = TRUE, filter_pd = FALSE, pattern = NULL, recursive = FALSE)
create_profiles(vcf_dir, min_depth = 10, filter_vc = TRUE, filter_mt = TRUE, filter_ns = TRUE, filter_gd = TRUE, filter_pd = FALSE, pattern = NULL, recursive = FALSE)
vcf_dir |
The VCF directory from which the profiles will be created (path). |
min_depth |
Remove variants below this sequencing depth (integer). |
filter_vc |
Filter variants failing variant caller criteria (boolean). |
filter_mt |
Filter mitochondrial variants (boolean). |
filter_ns |
Filter non-standard chromosomes (boolean). |
filter_gd |
Filter duplicate variants at the gene-level (boolean). |
filter_pd |
Filter duplicate variants at the position-level (boolean). |
pattern |
Only create profiles for a subset of files corresponding to this pattern (character). |
recursive |
Find VCF files recursively in sub-directories as well (boolean). |
This functions is a convenience-wrapper for the 'create_profile' function, which will create SNV profiles for each and every VCF file in the provided directory. The file naming scheme used is '<sample>.vcf' and will dictate the each profile's sample name.
A list of data frames.
# Path to the test VCF directory vcf_dir = system.file("extdata", package = "seqCAT") # Create SNV profiles profiles <- create_profiles(vcf_dir, pattern = "test", recursive = TRUE)
# Path to the test VCF directory vcf_dir = system.file("extdata", package = "seqCAT") # Create SNV profiles profiles <- create_profiles(vcf_dir, pattern = "test", recursive = TRUE)
Filter duplicated variants.
filter_duplicates(data, filter_gd = TRUE, filter_pd = FALSE)
filter_duplicates(data, filter_gd = TRUE, filter_pd = FALSE)
data |
The dataframe containing the variant data to be filtered. |
filter_gd |
Filter duplicate variants at the gene-level (boolean). |
filter_pd |
Filter duplicate variants at the position-level (boolean). |
This is a function for filtering duplicated variants either on the gene-level or the position-level.
A data frame containing the filtered variants.
# Load test comparisons data(test_profile_1) # Filter variants filtered_gene <- filter_duplicates(test_profile_1) filtered_position <- filter_duplicates(test_profile_1, filter_pd = TRUE)
# Load test comparisons data(test_profile_1) # Filter variants filtered_gene <- filter_duplicates(test_profile_1) filtered_position <- filter_duplicates(test_profile_1, filter_pd = TRUE)
Filter variants on several criteria.
filter_variants(data, min_depth = 10, filter_vc = FALSE, filter_mt = FALSE, filter_ns = FALSE)
filter_variants(data, min_depth = 10, filter_vc = FALSE, filter_mt = FALSE, filter_ns = FALSE)
data |
The dataframe containing the variant data to be filtered. |
min_depth |
Threshold for variant depth (integer). |
filter_vc |
Filter variants not passing filtering criteria (boolean). |
filter_mt |
Filter mitochondrial variants (boolean). |
filter_ns |
Filter non-standard chromosomes (boolean). |
This is a function for filtering SNV profiles on several criteria: sequencing depth, variant caller-specific filtering, mitochondrial variants and variants in non-standard chromosomes. Only filters by sequencing depth by default.
A data frame containing the filtered variants.
# Load test comparisons data(test_profile_1) # Filter variants filtered <- filter_variants(test_profile_1, min_depth = 15)
# Load test comparisons data(test_profile_1) # Filter variants filtered <- filter_variants(test_profile_1, min_depth = 15)
List all available samples in the COSMIC database
list_cosmic(file_path)
list_cosmic(file_path)
file_path |
The file containing COSMIC data (path). |
This function lists the available sample names in the provided COSMIC file (e.g. CosmicCLP_MutantExport.tsv.gz), and takes about half the time it takes to read the full file with the read_cosmic function, making it useful for just seeing if your particular sample is listed in COSMIC or not.
A vector of sample names
file <- system.file("extdata", "subset_CosmicCLP_MutantExport.tsv.gz", package = "seqCAT") cosmic_samples <- list_cosmic(file)
file <- system.file("extdata", "subset_CosmicCLP_MutantExport.tsv.gz", package = "seqCAT") cosmic_samples <- list_cosmic(file)
List known variants present in SNV profiles
list_variants(profiles, known_variants)
list_variants(profiles, known_variants)
profiles |
The SNV profiles to analyse (list) |
known_variants |
The known variants to look for (dataframe) |
This is a function for listing known variants present in SNV profiles. Input is a list of profiles and a dataframe of known variants, containing at least the genomic locations ("chr" and "pos"). Any additional columns will be retained.
A dataframe containing the known variant genotypes in each profile.
# Load test data data(test_profile_1) data(test_profile_2) # Create some variants to analyse known_variants <- data.frame(chr = 1, pos = 16229, gene = "DDX11L1") # List the known variants in each profile profiles <- list(test_profile_1, test_profile_2) known_variants <- list_variants(profiles, known_variants)
# Load test data data(test_profile_1) data(test_profile_2) # Create some variants to analyse known_variants <- data.frame(chr = 1, pos = 16229, gene = "DDX11L1") # List the known variants in each profile profiles <- list(test_profile_1, test_profile_2) known_variants <- list_variants(profiles, known_variants)
Plot a heatmap of similarities from many-to-many SNV profile comparisons.
plot_heatmap(similarities, cluster = TRUE, annotate = TRUE, annotate_size = 9, legend = TRUE, legend_size = c(36, 8), limits = c(0, 50, 90, 100), text_size = 14, colour = "#1954A6")
plot_heatmap(similarities, cluster = TRUE, annotate = TRUE, annotate_size = 9, legend = TRUE, legend_size = c(36, 8), limits = c(0, 50, 90, 100), text_size = 14, colour = "#1954A6")
similarities |
The long-format dataframe containing the data. |
cluster |
Cluster the samples based on similarity (boolean). |
annotate |
Annotate each cell with the score (boolean). |
annotate_size |
Text size of the annotations (numeric). |
legend |
Show a legend for the colour gradient (boolean). |
legend_size |
Height and width of the legend (vector of two integers). |
limits |
The limits for the colour gradient (vector of four integers). |
text_size |
Text size for axes, labels and legend (numeric). |
colour |
The main colour to use for the gradient (character). |
This function creates publication-ready plots of heatmaps for many-to-many sample comparisons, taking a long-format dataframe containing the summary statistics of each comparison as input.
A ggplot2 graphical object.
# Load test similarities data(test_similarities) # Plot a similarity heatmap heatmap <- plot_heatmap(test_similarities)
# Load test similarities data(test_similarities) # Plot a similarity heatmap heatmap <- plot_heatmap(test_similarities)
Plot SNV impact distributions for a binary SNV profile comparison.
plot_impacts(comparison, legend = TRUE, annotate = TRUE, annotate_size = 9, text_size = 14, palette = c("#0D2D59", "#1954A6"))
plot_impacts(comparison, legend = TRUE, annotate = TRUE, annotate_size = 9, text_size = 14, palette = c("#0D2D59", "#1954A6"))
comparison |
The SNV profile comparison to be plotted. |
legend |
Show the legend (boolean). |
annotate |
Annotate each category (boolean). |
annotate_size |
Text size for annotations (numeric). |
text_size |
Text size for axes, ticks and legend (numeric). |
palette |
Colour palette for filling of bars (character vector). |
This function creates publication-ready plots of the impact distribution from a binary dataset comparison across the matched/mismatched SNVs.
A ggplot2 graphical object.
# Load test comparison data data(test_comparison) # Plot the impact distribution impacts <- plot_impacts(test_comparison)
# Load test comparison data data(test_comparison) # Plot the impact distribution impacts <- plot_impacts(test_comparison)
Plot a genotype grid from a list of known variants
plot_variant_list(variant_list, legend = TRUE, legend_size = 22, text_size = 14, palette = c("#4e8ce4", "#a6c6f2", "#999999", "#cccccc"))
plot_variant_list(variant_list, legend = TRUE, legend_size = 22, text_size = 14, palette = c("#4e8ce4", "#a6c6f2", "#999999", "#cccccc"))
variant_list |
The data containing the variants (dataframe) |
legend |
Show a legend for the genotype colours (boolean) |
legend_size |
Size of the legend (numeric). |
text_size |
Text size for axes and legend (numeric). |
palette |
Nucleotide colour palette (4-element character vector) |
This function creates publication-ready plots from lists of known variants, taking a dataframe containing all the genotypes (on "A1/A2" format) for each sample (columns) and variant (row names).
A ggplot2 graphical object.
# Load test variant list data(test_variant_list) # Plot each variant's genotype per sample genotype_grid <- plot_variant_list(test_variant_list)
# Load test variant list data(test_variant_list) # Plot each variant's genotype per sample genotype_grid <- plot_variant_list(test_variant_list)
Read COSMIC sample-specific mutational data.
read_cosmic(file_path, sample_name = NULL, primary_site = NULL)
read_cosmic(file_path, sample_name = NULL, primary_site = NULL)
file_path |
The COSMIC data file path (path). |
sample_name |
Subset the data on sample name (character). |
primary_site |
Subset the data on primary tumour site (character). |
This function reads the COSMIC data files (e.g. "CosmicCLP_MutantExport.tsv.gz") and returns a GRanges object with all the listed mutations for the specified sample, which can then be use in downstream profile comparisons. Only non-duplicated (gene-level) SNVs are included in COSMIC profiles.
A dataframe with COSMIC SNVs.
# Path to COSMIC test data file <- system.file("extdata", "subset_CosmicCLP_MutantExport.tsv.gz", package = "seqCAT") # Read COSMIC test data for the HCT116 cell line cosmic_hct116 <- read_cosmic(file, "HCT116")
# Path to COSMIC test data file <- system.file("extdata", "subset_CosmicCLP_MutantExport.tsv.gz", package = "seqCAT") # Read COSMIC test data for the HCT116 cell line cosmic_hct116 <- read_cosmic(file, "HCT116")
Read an SNV profile stored on disk.
read_profile(file, sample_name = NULL)
read_profile(file, sample_name = NULL)
file |
The SNV profile to be read (path). |
sample_name |
Sample name to be added; overrides profile sample if it already exists (character). |
This is a function for reading SNV profiles created from VCF files that were stored on disk.
A data frame.
# Path to test data profile = system.file("extdata", "test_1.profile.txt.gz", package = "seqCAT") # Read test profile profile <- read_profile(profile)
# Path to test data profile = system.file("extdata", "test_1.profile.txt.gz", package = "seqCAT") # Read test profile profile <- read_profile(profile)
Read SNV profiles in a directory.
read_profiles(profile_dir, pattern = ".profile.txt", sample_names = FALSE)
read_profiles(profile_dir, pattern = ".profile.txt", sample_names = FALSE)
profile_dir |
The directory containing the profiles to be read (path). |
pattern |
Pattern for file name or extension to be read (character). |
sample_names |
Add sample name based on file name; overrides profile sample if it already exists (boolean). |
This is a wrapper function for reading multiple SNV profiles present in a directory (and its sub-directories in recursive mode).
A list of data frames.
# Path to test data profile_dir = system.file("extdata", package = "seqCAT") # Read test profiles profile_list <- read_profiles(profile_dir, pattern = "profile.txt")
# Path to test data profile_dir = system.file("extdata", package = "seqCAT") # Read test profiles profile_list <- read_profiles(profile_dir, pattern = "profile.txt")
The *seqCAT* package provides a number of functions for performing evaluation, characterisation and authentication of biological samples through analysis of high throughput sequencing data.
Overlapping and compared variants from "sample1" and "sample2" originating from the example.vcf file included in the inst/extdata directory, for use in unit tests.
data(test_comparison)
data(test_comparison)
A dataframe with 51 rows and 39 columns:
chromosome
SNV position
total variant depth, sample 1
allelic depth, allele 1, sample 1
allelic depth, allele 2, sample 1
allele 1, sample 1
allele 2, sample 1
warnings from variant calling, sample 1
total variant depth, sample 2
allelic depth, allele 1, sample 2
allelic depth, allele 2, sample 2
allele 1, sample 2
allele 2, sample 2
warnings from variant calling, sample 2
name, sample 1
name, sample 2
status of genotype comparison
mutation ID
associated gene
ensembl gene ID
ensembl transcript ID
reference allele
alternative allele
putative variant impact
variant effect
transcript feature
transcript biotype
SNV profile in GRanges format from "sample1", originating from the test_profile_1.txt in the inst/extdata directory, for use in unit tests.
data(test_profile_1)
data(test_profile_1)
A GRanges object with 383 elements and 17 metadata columns:
mutation ID, if available
associated gene
ensembl gene ID
ensembl transcript ID
reference allele
alternative allele
putative variant impact
variant effect
transcript feature
transcript biotype
total variant depth
allelic depth, allele 1
allelic depth, allele 2
allele 1
allele 2
warnings from variant calling
sample name
SNV profile in GRanges format from "sample2", originating from the test_profile_2.txt in the inst/extdata directory, for use in unit tests.
data(test_profile_2)
data(test_profile_2)
A GRanges object with 382 elements and 17 metadata columns:
mutation ID, if available
associated gene
ensembl gene ID
ensembl transcript ID
reference allele
alternative allele
putative variant impact
variant effect
transcript feature
transcript biotype
total variant depth
allelic depth, allele 1
allelic depth, allele 2
allele 1
allele 2
warnings from variant calling
sample name
SNV profile in GRanges format from "sample3", originating from the test_profile_3.txt in the inst/extdata directory, for use in unit tests.
data(test_profile_3)
data(test_profile_3)
A GRanges object with 99 elements and 9 metadata columns:
mutation ID, if available
reference allele
alternative allele
total variant depth
allelic depth, allele 1
allelic depth, allele 2
allele 1
allele 2
sample name
Collated similarities of multiple sample comparisons from "sample1" and "sample" from the example.vcf file, for use in unit tests.
data(test_similarities)
data(test_similarities)
A dataframe with 3 rows and 6 columns:
name of sample 1
name of sample 2
the number of overlaps for the comparison
the number of matches for the comparison
the concordance of the profiles
the similarity score of the profiles
A variant list object from the 'list_variants' function, where the row names have been defined as "chr: pos (gene)" and the corresponding columns removed, for use in plotting.
data(test_variant_list)
data(test_variant_list)
A dataframe with 2 rows and 2 columns:
the genotypes of sample1
the genotypes of sample2
Write an SNV profile to a file for later re-use.
write_profile(profile, file)
write_profile(profile, file)
profile |
The SNV profile to be written (data frame). |
file |
The file to write to (path). |
This is a function for writing SNV profiles (created from VCF files) to disk for later re-use. Several formats are allowed, including BED, GTF, GFF and normal text files, which are automatically recognised based on the supplied filename.
None; writes to disk only.
# Load test profile data(test_profile_1) # Write test profile to file write_profile(test_profile_1, "test_profile_1.txt")
# Load test profile data(test_profile_1) # Write test profile to file write_profile(test_profile_1, "test_profile_1.txt")
Write several SNV profiles to file for later re-use.
write_profiles(profile_list, format = "TXT", directory = "./")
write_profiles(profile_list, format = "TXT", directory = "./")
profile_list |
The SNV profiles to be written (list). |
format |
The desired file format (character). |
directory |
The directory to write to (path). |
This is a wrapper function for writing multiple SNV profiles present in a directory (and its sub-directories in recursive mode).
None; writes to disk only.
# Load test profiles data(test_profile_1) data(test_profile_2) profiles <- list(test_profile_1, test_profile_2) # Write test profile to file write_profiles(profiles, format = "TXT", directory = "./")
# Load test profiles data(test_profile_1) data(test_profile_2) profiles <- list(test_profile_1, test_profile_2) # Write test profile to file write_profiles(profiles, format = "TXT", directory = "./")