Title: | Identification of cancer cell lines based on their weighted mutational/ variational fingerprint |
---|---|
Description: | 'Uniquorn' enables users to identify cancer cell lines. Cancer cell line misidentification and cross-contamination reprents a significant challenge for cancer researchers. The identification is vital and in the frame of this package based on the locations/ loci of somatic and germline mutations/ variations. The input format is vcf/ vcf.gz and the files have to contain a single cancer cell line sample (i.e. a single member/genotype/gt column in the vcf file). |
Authors: | Raik Otto |
Maintainer: | 'Raik Otto' <[email protected]> |
License: | Artistic-2.0 |
Version: | 2.27.0 |
Built: | 2024-11-19 04:42:41 UTC |
Source: | https://github.com/bioc/Uniquorn |
add_custom_vcf_to_database This function adds the variants of parsed custom CCLs to a monet DB instance
add_custom_vcf_to_database( vcf_input_files, ref_gen = "GRCH37", library_name = "CUSTOM", n_threads = 1, test_mode = FALSE )
add_custom_vcf_to_database( vcf_input_files, ref_gen = "GRCH37", library_name = "CUSTOM", n_threads = 1, test_mode = FALSE )
vcf_input_files |
a character vector containing the input vcf files. This may be one or many vcf files. |
ref_gen |
a character string specifying the reference genome version.
All training sets are associated with a reference genome version.
Default is |
library_name |
a character string giving the name of the library to add the
cancer cell lines to. Default is |
n_threads |
an integer specifying the number of threads to be used. |
test_mode |
Is this a test? Just for internal use |
Message wheather the adding was successful
HT29_vcf_file = system.file("extdata/HT29_TEST.vcf", package = "Uniquorn"); add_custom_vcf_to_database( vcf_input_files = HT29_vcf_file, library_name = "CELLMINER", ref_gen = "GRCH37", n_threads = 1, test_mode = TRUE )
HT29_vcf_file = system.file("extdata/HT29_TEST.vcf", package = "Uniquorn"); add_custom_vcf_to_database( vcf_input_files = HT29_vcf_file, library_name = "CELLMINER", ref_gen = "GRCH37", n_threads = 1, test_mode = TRUE )
add_missing_cls
add_missing_cls(res_table, dif_cls)
add_missing_cls(res_table, dif_cls)
res_table |
Table that contains the identification results |
dif_cls |
Missing CLs |
Results table with added missing cls
A hypergeometric distribution-assumption allows to calculate the p-values for a significant or non-significant overlap in this function
add_p_q_values_statistics( g_query, match_t, p_value, ref_gen, minimum_matching_mutations, top_hits_per_library )
add_p_q_values_statistics( g_query, match_t, p_value, ref_gen, minimum_matching_mutations, top_hits_per_library )
g_query |
IRanges object that contains the query variants |
match_t |
A table that contains the nubmber of matching variants |
p_value |
Threshold for the significance p-value |
ref_gen |
Reference genome version |
minimum_matching_mutations |
Manual lower amount of matching mutations require for a significant match between a query and a reference |
top_hits_per_library |
limits significant similarities to the first n hits |
add_p_q_values_statistics
Calculates the p-values
R table with a statistic
Add penalty statistics to results
add_penalty_statistics(match_t, minimum_matching_mutations)
add_penalty_statistics(match_t, minimum_matching_mutations)
match_t |
object that contains the matching variants |
minimum_matching_mutations |
a numerical giving the minimum amount of mutations that has to match between query and training sample for a positive prediction |
The updated statistics
Creates BED files from the found and not found annotated mutations
create_bed_file( match_t, vcf_fingerprint, output_file, ref_gen, manual_identifier )
create_bed_file( match_t, vcf_fingerprint, output_file, ref_gen, manual_identifier )
match_t |
R table which contains the mutations from the training database for the cancer cell lines |
vcf_fingerprint |
contains the mutations that are present in the query cancer cell line's vcf file |
output_file |
Path to output file |
ref_gen |
Reference genome version |
manual_identifier |
Manually enter a vector of CL name(s) whose bed files should be created, independently from them passing the detection threshold |
Returns a message which indicates if the BED file creation has succeeded
Identifies a cancer cell lines contained in a vcf file based on the pattern (start & length) of all contained mutations/ variations.
identify_vcf_file( vcf_file, output_file, ref_gen, minimum_matching_mutations, mutational_weight_inclusion_threshold, write_xls, output_bed_file, top_hits_per_library, manual_identifier, verbose, p_value, confidence_score, n_threads, write_results )
identify_vcf_file( vcf_file, output_file, ref_gen, minimum_matching_mutations, mutational_weight_inclusion_threshold, write_xls, output_bed_file, top_hits_per_library, manual_identifier, verbose, p_value, confidence_score, n_threads, write_results )
vcf_file |
Input vcf file. Only one sample column allowed. |
output_file |
Path of the output file. If blank, autogenerated as name of input file plus '_uniquorn_ident.tab' suffix. |
ref_gen |
Reference genome version. All training sets are associated with a reference genome version. Default: GRCH37 |
minimum_matching_mutations |
The minimum amount of mutations that has to match between query and training sample for a positive prediction |
mutational_weight_inclusion_threshold |
Include only mutations with a weight of at least x. Range: 0.0 to 1.0. 1= unique to CL. ~0 = found in many CL samples. |
write_xls |
Create identification results additionally as xls file for easier reading |
output_bed_file |
If BED files for IGV visualization should be created for the Cancer Cell lines that pass the threshold |
top_hits_per_library |
Limit the number of significant similarities per library to n (default 3) many hits. Is particularrly used in contexts when heterogeneous query and reference CCLs are being compared. |
manual_identifier |
Manually enter a vector of CL name(s) whose bed files should be created, independently from them passing the detection threshold |
verbose |
Print additional information |
p_value |
Required p-value for identification. Note that if you set the confidence score, the confidence score overrides the p-value |
confidence_score |
Cutoff for positive prediction between 0 and 100. Calculated by transforming the p-value by -1 * log(p-value) Note that if you set the confidence score, the confidence score overrides the p-value |
n_threads |
Number of threads to be used |
write_results |
Write identification results to file |
identify_vcf_file
parses the vcf file and predicts
the identity of the sample
R table with a statistic of the identification result
HT29_vcf_file = system.file("extdata/HT29.vcf", package = "Uniquorn"); identification = identify_vcf_file( vcf_file = HT29_vcf_file, verbose = FALSE, write_results = FALSE )
HT29_vcf_file = system.file("extdata/HT29.vcf", package = "Uniquorn"); identification = identify_vcf_file( vcf_file = HT29_vcf_file, verbose = FALSE, write_results = FALSE )
Initiate the analysis Output basic information
init_and_load_identification( verbose, ref_gen, vcf_file, output_dir )
init_and_load_identification( verbose, ref_gen, vcf_file, output_dir )
verbose |
Print additional information |
ref_gen |
Reference genome version. All training sets are associated with a reference genome version. Default: GRCH37 |
vcf_file |
Path to vcf_file |
output_dir |
Output directory for identification results |
init_and_load_identification
parses vcf file and output
basic information
Three file path instances and the fingerprint
Parses data into r list variable
initiate_canonical_databases( cosmic_file = "CosmicCLP_MutantExport.tsv.gz", ccle_file = "CCLE_mutations.csv", ccle_sample_file = "sample_info.csv", ref_gen = "GRCH38" )
initiate_canonical_databases( cosmic_file = "CosmicCLP_MutantExport.tsv.gz", ccle_file = "CCLE_mutations.csv", ccle_sample_file = "sample_info.csv", ref_gen = "GRCH38" )
cosmic_file |
The path to the Cosmic CLP file. The Cosmic file can be obtained from "https://cancer.sanger.ac.uk/cell_lines/download" and should be labeled "CosmicCLP_MutantExport.tsv.gz". Ensure that the right reference genome is used |
ccle_file |
The path to the ccle DNA genotype data file. It should be labeled "CCLE_mutations.csv". Ensure that the right reference genome is used |
ccle_sample_file |
The path to the CCLE sample file. It should be labeled "sample_info.csv" containing both the DepMap ID and corresponding cell line name. |
ref_gen |
Reference genome version |
Returns message if parsing process has succeeded
initiate_canonical_databases( cosmic_file = "CosmicCLP_MutantExport.tsv.gz", ccle_file = "CCLE_mutations.csv", ccle_sample_file = "sample_info.csv", ref_gen = "GRCH38" )
initiate_canonical_databases( cosmic_file = "CosmicCLP_MutantExport.tsv.gz", ccle_file = "CCLE_mutations.csv", ccle_sample_file = "sample_info.csv", ref_gen = "GRCH38" )
Matches query ccl to the database
match_query_ccl_to_database( g_query, ref_gen = "GRCH37", library_name, mutational_weight_inclusion_threshold )
match_query_ccl_to_database( g_query, ref_gen = "GRCH37", library_name, mutational_weight_inclusion_threshold )
g_query |
IRanges object that contains the variants |
ref_gen |
Reference genome version. All training sets are associated with a reference genome version. Default: GRCH37 |
library_name |
a character string giving the name of the library |
mutational_weight_inclusion_threshold |
a numerical giving the lower bound for mutational weight to be included |
The R Table sim_list which contains the CoSMIC CLP fingerprints
Intern utility function. Filters the parsed VCF file for all informations except for the start and length of variations/mutations.
parse_vcf_file( vcf_file, ref_gen, library_name )
parse_vcf_file( vcf_file, ref_gen, library_name )
vcf_file |
character string giving the path to the vcf file on the operating system. |
ref_gen |
Reference genome version |
library_name |
Name of the reference library |
Loci-based DNA-mutational fingerprint of the cancer cell line as found in the input VCF file.
parse_vcf_query_into_db This function adds the variants of parsed custom CCLs to a monet DB instance
parse_vcf_query_into_db( g_query, ref_gen = "GRCH37", library_name, test_mode = FALSE )
parse_vcf_query_into_db( g_query, ref_gen = "GRCH37", library_name, test_mode = FALSE )
g_query |
a GenomicRanges object |
ref_gen |
a character string specifying the reference genome version.
All training sets are associated with a reference genome version.
Default is |
library_name |
a character string giving the name of the library to add
the cancer cell lines to. Default is |
test_mode |
Is this a test? Just for internal use |
Message wheather the adding was successful
This function procides information on the reference library names
read_library_names(ref_gen)
read_library_names(ref_gen)
ref_gen |
a character vector specifying the reference genome
version. All training sets are associated with a reference genome
version. Default is |
Returns a character vector of the contained libraries
read_library_names(ref_gen = "GRCH37")
read_library_names(ref_gen = "GRCH37")
Read the GRange object for a specific library
read_mutation_grange_objects( library_name, mutational_weight_inclusion_threshold, ref_gen, test_mode )
read_mutation_grange_objects( library_name, mutational_weight_inclusion_threshold, ref_gen, test_mode )
library_name |
a character string giving the name of the library |
mutational_weight_inclusion_threshold |
a numerical giving the lower bound for mutational weight to be included |
ref_gen |
Reference genome version. All training sets are associated with a reference genome version. Default: GRCH37 |
test_mode |
Is this a test? Just for internal use |
The R Table sim_list which contains the CoSMIC CLP fingerprints
This function removes a cancer cell line training fingerprint (VCF file)
from the database. The names of all training sets can
be seen by using the function show_contained_cls
.
remove_ccls_from_database(ccl_names, ref_gen = "GRCH37", library_name, test_mode = FALSE)
remove_ccls_from_database(ccl_names, ref_gen = "GRCH37", library_name, test_mode = FALSE)
ccl_names |
A character vector giving the names of the cancer cell line identifiers to be removed. Can be one or many |
ref_gen |
A character vector specifying the reference genome version.
All training sets are associated with a reference genome version.
Default is |
library_name |
Name of the library from which ccls are to be removed |
test_mode |
Signifies if this is a test run |
Message that indicates whether the removal was succesful.
remove_ccls_from_database( ccl_names = "HT29", ref_gen = "GRCH37", library_name = "CELLMINER", test_mode = TRUE )
remove_ccls_from_database( ccl_names = "HT29", ref_gen = "GRCH37", library_name = "CELLMINER", test_mode = TRUE )
This function removes a entire library from the database by removing all associated cancer cell line fingerprints from the database.
remove_library_from_database(library, ref_gen = "GRCH37", test_mode = FALSE)
remove_library_from_database(library, ref_gen = "GRCH37", test_mode = FALSE)
library |
a character vector giving the names of the library to be removed. |
ref_gen |
a character vector specifying the reference genome version.
All training sets are associated with a reference genome version.
Default is |
test_mode |
is this a test? Just for internal use. |
Message that indicates whether the removal was succesful.
remove_library_from_database(library = "CELLMINER", ref_gen = "GRCH37", test_mode = TRUE)
remove_library_from_database(library = "CELLMINER", ref_gen = "GRCH37", test_mode = TRUE)
This function displays the names, amount of mutations and the overall weight of the mutations of all contained cancer cell line fingerprints for a chosen reference genome and optional library.
show_contained_ccls(ref_gen, verbose)
show_contained_ccls(ref_gen, verbose)
ref_gen |
a character vector specifying the reference genome version.
All training sets are associated with a reference genome version.
Default is |
verbose |
Should DB informations be printed? |
R table which contains identifiers of all cancer cell line samples which match the specified parameters (reference genome and library).
## Show all contained cancer cell lines for reference GRCH37: show_contained_ccls(ref_gen = "GRCH37", verbose = TRUE)
## Show all contained cancer cell lines for reference GRCH37: show_contained_ccls(ref_gen = "GRCH37", verbose = TRUE)
This function shows all mutations present in the database for a selected cancer cell line and reference genome.
show_contained_variants_for_ccl( name_ccl, ref_gen, library_name, mutational_weight_inclusion_threshold )
show_contained_variants_for_ccl( name_ccl, ref_gen, library_name, mutational_weight_inclusion_threshold )
name_ccl |
a character vector giving the identifier of the cancer cell line for which mutations will be shown. |
ref_gen |
a character vector specifying the reference
genome version. All training sets are associated with a
reference genome version. Default is |
library_name |
Name of the reference library |
mutational_weight_inclusion_threshold |
Include only mutations with a weight of at least x. Range: 0.0 to 1.0. 1= unique to CL. ~0 = found in many CCL samples. |
GenomicRanges object that contains the ccl's variants
## Show all mutations for Cancer Cell Line 'SK_OV_3' show_contained_variants_for_ccl( name_ccl = "SK_OV_3", ref_gen = "GRCH37", library_name = "CELLMINER", mutational_weight_inclusion_threshold = 0 )
## Show all mutations for Cancer Cell Line 'SK_OV_3' show_contained_variants_for_ccl( name_ccl = "SK_OV_3", ref_gen = "GRCH37", library_name = "CELLMINER", mutational_weight_inclusion_threshold = 0 )
This function shows all variants contained in a reference library for a given inclusion weight. Default inclusion weight is 0 (all variants).
show_contained_variants_in_library( ref_gen, library_name, mutational_weight_inclusion_threshold )
show_contained_variants_in_library( ref_gen, library_name, mutational_weight_inclusion_threshold )
ref_gen |
a character vector specifying the reference genome
version. All training sets are associated with a reference genome version.
Default is |
library_name |
Name of the reference library. |
mutational_weight_inclusion_threshold |
Include only mutations with a weight of at least x. Range: 0.0 to 1.0. 1 = unique to CL. ~0 = found in many CL samples. |
Returns a GenomicRanges object that contains the variants
## Show all variants contained in reference library CELLMINER show_contained_variants_in_library( ref_gen = "GRCH37", library_name = "CELLMINER", mutational_weight_inclusion_threshold = 0 )
## Show all variants contained in reference library CELLMINER show_contained_variants_in_library( ref_gen = "GRCH37", library_name = "CELLMINER", mutational_weight_inclusion_threshold = 0 )
This function displays all cancer cell lines in the database which contain a specified variant. Utilizes closed interval coordinates.
show_which_ccls_contain_variant( start, end, chromosome, ref_gen, library_name, mutational_weight_inclusion_threshold )
show_which_ccls_contain_variant( start, end, chromosome, ref_gen, library_name, mutational_weight_inclusion_threshold )
start |
Start coordinate |
end |
Stop coordinate |
chromosome |
Chromosome, 'chr' prefixes are ignored |
ref_gen |
a character vector specifying the reference genome
version. All training sets are associated with a reference genome
version. Default is |
library_name |
Name of the reference library |
mutational_weight_inclusion_threshold |
Include only mutations with a weight of at least x. Range: 0.0 to 1.0. 1= unique to CL. ~0 = found in many CCL samples. |
Returns a GenomicRanges object that contains the variant if present. Member ccls can be found in the $Member_ccl vector
show_which_ccls_contain_variant( start = 92030762, end = 92030762, chromosome = 8, ref_gen = "GRCH37", library_name = "CELLMINER", mutational_weight_inclusion_threshold = 0 )
show_which_ccls_contain_variant( start = 92030762, end = 92030762, chromosome = 8, ref_gen = "GRCH37", library_name = "CELLMINER", mutational_weight_inclusion_threshold = 0 )