Title: | Analyze gene therapy vector insertion sites data identified from genomics next generation sequencing reads for clonal tracking studies |
---|---|
Description: | In gene therapy, stem cells are modified using viral vectors to deliver the therapeutic transgene and replace functional properties since the genetic modification is stable and inherited in all cell progeny. The retrieval and mapping of the sequences flanking the virus-host DNA junctions allows the identification of insertion sites (IS), essential for monitoring the evolution of genetically modified cells in vivo. A comprehensive toolkit for the analysis of IS is required to foster clonal trackign studies and supporting the assessment of safety and long term efficacy in vivo. This package is aimed at (1) supporting automation of IS workflow, (2) performing base and advance analysis for IS tracking (clonal abundance, clonal expansions and statistics for insertional mutagenesis, etc.), (3) providing basic biology insights of transduced stem cells in vivo. |
Authors: | Francesco Gazzo [cre], Giulia Pais [aut] , Andrea Calabria [aut], Giulio Spinozzi [aut] |
Maintainer: | Francesco Gazzo <[email protected]> |
License: | CC BY 4.0 |
Version: | 1.17.1 |
Built: | 2025-01-05 03:49:34 UTC |
Source: | https://github.com/bioc/ISAnalytics |
Groups metadata by the specified grouping keys and returns a
summary of info for each group. For more details on how to use this function:
vignette("workflow_start", package = "ISAnalytics")
aggregate_metadata( association_file, grouping_keys = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), aggregating_functions = default_meta_agg(), import_stats = lifecycle::deprecated() )
aggregate_metadata( association_file, grouping_keys = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), aggregating_functions = default_meta_agg(), import_stats = lifecycle::deprecated() )
association_file |
The imported association file (via import_association_file) |
grouping_keys |
A character vector of column names to form a grouping operation |
aggregating_functions |
A data frame containing specifications of the functions to be applied to columns in the association file during aggregation. It defaults to default_meta_agg. The structure of this data frame should be maintained if the user wishes to change the defaults. |
import_stats |
The import of VISPA2 stats has been moved to its dedicated function, see import_Vispa2_stats. |
An aggregated data frame
Other Data cleaning and pre-processing:
aggregate_values_by_key()
,
compute_near_integrations()
,
default_meta_agg()
,
outlier_filter()
,
outliers_by_pool_fragments()
,
purity_filter()
,
realign_after_collisions()
,
remove_collisions()
,
threshold_filter()
data("association_file", package = "ISAnalytics") aggreg_meta <- aggregate_metadata( association_file = association_file ) head(aggreg_meta)
data("association_file", package = "ISAnalytics") aggreg_meta <- aggregate_metadata( association_file = association_file ) head(aggreg_meta)
Performs aggregation on values contained in the integration matrices based
on the key and the specified lambda. For more details on how to use this
function:
vignette("workflow_start", package = "ISAnalytics")
aggregate_values_by_key( x, association_file, value_cols = "Value", key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), lambda = list(sum = ~sum(.x, na.rm = TRUE)), group = c(mandatory_IS_vars(), annotation_IS_vars()), join_af_by = "CompleteAmplificationID" )
aggregate_values_by_key( x, association_file, value_cols = "Value", key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), lambda = list(sum = ~sum(.x, na.rm = TRUE)), group = c(mandatory_IS_vars(), annotation_IS_vars()), join_af_by = "CompleteAmplificationID" )
x |
A single integration matrix or a list of imported integration matrices |
association_file |
The imported association file |
value_cols |
A character vector containing the names of the columns to apply the given lambdas. Must be numeric or integer columns. |
key |
A string or a character vector with column names of the association file to take as key |
lambda |
A named list of functions or purrr-style lambdas. See details section. |
group |
Other variables to include in the grouping besides |
join_af_by |
A character vector representing the joining key between the matrix and the metadata. Useful to re-aggregate already aggregated matrices. |
The lambda parameter should always contain a named list of either functions or purrr-style lambdas. It is also possible to specify the namespace of the function in both ways, for example:
lambda = list(sum = sum, desc = psych::describe)
Using purrr-style lambdas allows to specify arguments for the functions,
keeping in mind that the first parameter should always be .x
:
lambda = list(sum = ~sum(.x, na.rm = TRUE))
It is also possible to use custom user-defined functions, keeping in mind that the symbol will be evaluated in the calling environment, for example if the function is called in the global environment and lambda contains "foo" as a function, "foo" will be evaluated in the global environment.
foo <- function(x) { sum(x) } lambda = list(sum = ~sum(.x, na.rm = TRUE), foo = foo) # Or with lambda notation lambda = list(sum = ~sum(.x, na.rm = TRUE), foo = ~foo(.x))
Functions passed in the lambda parameters must respect a few constraints to properly work and it's the user responsibility to ensure this.
Functions have to accept as input a numeric or integer vector
Function should return a single value or a list/data frame: if a list or a data frame is returned as a result, all the columns will be added to the final data frame.
A list of data frames or a single data frame aggregated according to the specified arguments
Other Data cleaning and pre-processing:
aggregate_metadata()
,
compute_near_integrations()
,
default_meta_agg()
,
outlier_filter()
,
outliers_by_pool_fragments()
,
purity_filter()
,
realign_after_collisions()
,
remove_collisions()
,
threshold_filter()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) head(aggreg)
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) head(aggreg)
This helper function checks if each individual integration site,
identified by the mandatory_IS_vars()
,
has been annotated with two or more distinct gene symbols.
annotation_issues(matrix)
annotation_issues(matrix)
matrix |
Either a single matrix or a list of matrices, ideally obtained
via |
Either NULL
if no issues were detected or 1 or more data frames
with genomic coordinates of the IS and the number of distinct
genes associated
Other Import functions helpers:
date_formats()
,
default_af_transform()
,
default_iss_file_prefixes()
,
matching_options()
,
quantification_types()
data("integration_matrices", package = "ISAnalytics") annotation_issues(integration_matrices)
data("integration_matrices", package = "ISAnalytics") annotation_issues(integration_matrices)
This function is particularly useful when a sparse matrix structure is needed by a specific function (mainly from other packages).
as_sparse_matrix( x, single_value_col = "Value", fragmentEstimate = "fragmentEstimate", seqCount = "seqCount", barcodeCount = "barcodeCount", cellCount = "cellCount", ShsCount = "ShsCount", key = pcr_id_column() )
as_sparse_matrix( x, single_value_col = "Value", fragmentEstimate = "fragmentEstimate", seqCount = "seqCount", barcodeCount = "barcodeCount", cellCount = "cellCount", ShsCount = "ShsCount", key = pcr_id_column() )
x |
A single tidy integration matrix or a list of integration matrices. Supports also multi-quantification matrices obtained via comparison_matrix |
single_value_col |
Name of the column containing the values when providing a single-quantification matrix |
fragmentEstimate |
For multi-quantification matrix support: the name of the fragment estimate values column |
seqCount |
For multi-quantification matrix support: the name of the sequence count values column |
barcodeCount |
For multi-quantification matrix support: the name of the barcode count values column |
cellCount |
For multi-quantification matrix support: the name of the cell count values column |
ShsCount |
For multi-quantification matrix support: the name of the Shs Count values column |
key |
The name of the sample identifier fields (for aggregated matrices can be a vector with more than 1 element) |
Depending on input, 2 possible outputs:
A single sparse matrix (data frame) if input is a single quantification matrix
A list of sparse matrices divided by quantification if input is a single multi-quantification matrix or a list of matrices
Other Utilities:
comparison_matrix()
,
enable_progress_bars()
,
export_ISA_settings()
,
generate_Vispa2_launch_AF()
,
generate_blank_association_file()
,
generate_default_folder_structure()
,
import_ISA_settings()
,
separate_quant_matrices()
,
transform_columns()
data("integration_matrices", package = "ISAnalytics") sparse <- as_sparse_matrix(integration_matrices)
data("integration_matrices", package = "ISAnalytics") sparse <- as_sparse_matrix(integration_matrices)
This file is a simple example of association file. Use it as
reference to properly fill out yours.
To generate an empty association file to fill see the
generate_blank_association_file()
function.
data("association_file")
data("association_file")
An object of class data.table
(inherits from data.frame
) with 53 rows and 83 columns.
The data was obtained manually by simulating real research data.
generate_blank_association_file
A character vector containing all the names of the currently supported outliers tests that can be called in the function outlier_filter.
available_outlier_tests()
available_outlier_tests()
A character vector
available_outlier_tests()
available_outlier_tests()
Contains all information associated with critical tags used in the dynamic
vars system. To know more see
vignette("workflow_start", package="ISAnalytics")
.
available_tags()
available_tags()
A data frame
available_tags()
available_tags()
A default table with info relative to different blood lineages associated
with cell markers that can be supplied as a parameter to
HSC_population_size_estimate
blood_lineages_default()
blood_lineages_default()
A data frame
blood_lineages_default()
blood_lineages_default()
For this functionality
the suggested package
circlize
is required.
Please note that this function is a simple wrapper of basic circlize
functions, for an in-depth explanation on how the functions work and
additional arguments please refer to the official documentation
Circular Visualization in R
circos_genomic_density( data, gene_labels = NULL, label_col = NULL, cytoband_specie = "hg19", track_colors = "navyblue", grDevice = c("png", "pdf", "svg", "jpeg", "bmp", "tiff", "default"), file_path = getwd(), ... )
circos_genomic_density( data, gene_labels = NULL, label_col = NULL, cytoband_specie = "hg19", track_colors = "navyblue", grDevice = c("png", "pdf", "svg", "jpeg", "bmp", "tiff", "default"), file_path = getwd(), ... )
data |
Either a single integration matrix or a list of integration matrices. If a list is provided, a separate density track for each data frame is plotted. |
gene_labels |
Either |
label_col |
Numeric index of the column of |
cytoband_specie |
Specie for initializing the cytoband |
track_colors |
Colors to give to density tracks. If more than one
integration matrix is provided as |
grDevice |
The graphical device where the plot should be traced.
|
file_path |
If a device other than |
... |
Additional named arguments to pass on to chosen device,
|
If genomic labels should be plotted alongside genomic density tracks,
the user should provide them as a simple data frame in standard bed format,
namely chr
, start
, end
plus a column containing the labels.
NOTE: if the user decides to plot on the default device (viewer in RStudio),
he must ensure there is enough space for all elements to be plotted,
otherwise an error message is thrown.
NULL
Other Plotting functions:
CIS_volcano_plot()
,
HSC_population_plot()
,
fisher_scatterplot()
,
integration_alluvial_plot()
,
sharing_heatmap()
,
sharing_venn()
,
top_abund_tableGrob()
,
top_cis_overtime_heatmap()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) by_subj <- aggreg |> dplyr::group_by(.data$SubjectID) |> dplyr::group_split() circos_genomic_density(by_subj, track_colors = c("navyblue", "gold"), grDevice = "default", track.height = 0.1 )
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) by_subj <- aggreg |> dplyr::group_by(.data$SubjectID) |> dplyr::group_split() circos_genomic_density(by_subj, track_colors = c("navyblue", "gold"), grDevice = "default", track.height = 0.1 )
Statistical approach for the validation of common insertion sites significance based on the comparison of the integration frequency at the CIS gene with respect to other genes contained in the surrounding genomic regions. For more details please refer to this paper: https://ashpublications.org/blood/article/117/20/5332/21206/Lentiviral-vector-common-integration-sites-in
CIS_grubbs( x, genomic_annotation_file = "hg19", grubbs_flanking_gene_bp = 1e+05, threshold_alpha = 0.05, by = NULL, return_missing_as_df = TRUE, results_as_list = TRUE )
CIS_grubbs( x, genomic_annotation_file = "hg19", grubbs_flanking_gene_bp = 1e+05, threshold_alpha = 0.05, by = NULL, return_missing_as_df = TRUE, results_as_list = TRUE )
x |
An integration matrix, must include the |
genomic_annotation_file |
Database file for gene annotation, see details. |
grubbs_flanking_gene_bp |
Number of base pairs flanking a gene |
threshold_alpha |
Significance threshold |
by |
Either |
return_missing_as_df |
Returns those genes present in the input df but not in the refgenes as a data frame? |
results_as_list |
If |
A data frame containing
genes annotation for the specific genome.
From version 1.5.4
the argument genomic_annotation_file
accepts only
data frames or package provided defaults.
The user is responsible for importing the appropriate tabular files if
customization is needed.
The annotations for the human genome (hg19) and
murine genome (mm9) are already
included in this package: to use one of them just
set the argument genomic_annotation_file
to either "hg19"
or
"mm9"
.
If for any reason the user is performing an analysis on another genome,
this file needs to be changed respecting the USCS Genome Browser
format, meaning the input file headers should include:
name2, chrom, strand, min_txStart, max_txEnd, minmax_TxLen, average_TxLen, name, min_cdsStart, max_cdsEnd, minmax_CdsLen, average_CdsLen
A data frame
The function will explicitly check for the presence of these tags:
chromosome
locus
is_strand
gene_symbol
gene_strand
Other Analysis functions:
HSC_population_size_estimate()
,
compute_abundance()
,
cumulative_is()
,
gene_frequency_fisher()
,
is_sharing()
,
iss_source()
,
sample_statistics()
,
top_integrations()
,
top_targeted_genes()
data("integration_matrices", package = "ISAnalytics") cis <- CIS_grubbs(integration_matrices) cis
data("integration_matrices", package = "ISAnalytics") cis <- CIS_grubbs(integration_matrices) cis
Computes common insertion sites and Grubbs test for each separate group
and separating different time points among the same group. The logic
applied is the same as the function CIS_grubbs()
.
CIS_grubbs_overtime( x, genomic_annotation_file = "hg19", grubbs_flanking_gene_bp = 1e+05, threshold_alpha = 0.05, group = "SubjectID", timepoint_col = "TimePoint", as_df = TRUE, return_missing_as_df = TRUE, max_workers = NULL )
CIS_grubbs_overtime( x, genomic_annotation_file = "hg19", grubbs_flanking_gene_bp = 1e+05, threshold_alpha = 0.05, group = "SubjectID", timepoint_col = "TimePoint", as_df = TRUE, return_missing_as_df = TRUE, max_workers = NULL )
x |
An integration matrix, must include the |
genomic_annotation_file |
Database file for gene annotation, see details. |
grubbs_flanking_gene_bp |
Number of base pairs flanking a gene |
threshold_alpha |
Significance threshold |
group |
A character vector of column names that identifies a group. Each group must contain one or more time points. |
timepoint_col |
What is the name of the column containing time points? |
as_df |
Choose the result format: if |
return_missing_as_df |
Returns those genes present in the input df but not in the refgenes as a data frame? |
max_workers |
Maximum number of parallel workers. If |
A data frame containing
genes annotation for the specific genome.
From version 1.5.4
the argument genomic_annotation_file
accepts only
data frames or package provided defaults.
The user is responsible for importing the appropriate tabular files if
customization is needed.
The annotations for the human genome (hg19) and
murine genome (mm9) are already
included in this package: to use one of them just
set the argument genomic_annotation_file
to either "hg19"
or
"mm9"
.
If for any reason the user is performing an analysis on another genome,
this file needs to be changed respecting the USCS Genome Browser
format, meaning the input file headers should include:
name2, chrom, strand, min_txStart, max_txEnd, minmax_TxLen, average_TxLen, name, min_cdsStart, max_cdsEnd, minmax_CdsLen, average_CdsLen
A list with results and optionally missing genes info
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) cis_overtime <- CIS_grubbs_overtime(aggreg) cis_overtime
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) cis_overtime <- CIS_grubbs_overtime(aggreg) cis_overtime
Traces a volcano plot for IS frequency and CIS results.
CIS_volcano_plot( x, onco_db_file = "proto_oncogenes", tumor_suppressors_db_file = "tumor_suppressors", species = "human", known_onco = known_clinical_oncogenes(), suspicious_genes = clinical_relevant_suspicious_genes(), significance_threshold = 0.05, annotation_threshold_ontots = 0.1, highlight_genes = NULL, title_prefix = NULL, return_df = FALSE )
CIS_volcano_plot( x, onco_db_file = "proto_oncogenes", tumor_suppressors_db_file = "tumor_suppressors", species = "human", known_onco = known_clinical_oncogenes(), suspicious_genes = clinical_relevant_suspicious_genes(), significance_threshold = 0.05, annotation_threshold_ontots = 0.1, highlight_genes = NULL, title_prefix = NULL, return_df = FALSE )
x |
Either a simple integration matrix or a data frame resulting
from the call to CIS_grubbs with |
onco_db_file |
Uniprot file for proto-oncogenes (see details). If different from default, should be supplied as a path to a file. |
tumor_suppressors_db_file |
Uniprot file for tumor-suppressor genes. If different from default, should be supplied as a path to a file. |
species |
One between |
known_onco |
Data frame with known oncogenes. See details. |
suspicious_genes |
Data frame with clinical relevant suspicious genes. See details. |
significance_threshold |
The significance threshold |
annotation_threshold_ontots |
Value above which genes are annotated with colorful labels |
highlight_genes |
Either |
title_prefix |
A string or character vector to be displayed
in the title - usually the
project name and other characterizing info. If a vector is supplied,
it is concatenated in a single string via |
return_df |
Return the data frame used to generate the plot? This can be useful if the user wants to manually modify the plot with ggplot2. If TRUE the function returns a list containing both the plot and the data frame. |
Users can supply as x
either a simple integration matrix or a
data frame resulting from the call to CIS_grubbs.
In the first case an internal call to
the function CIS_grubbs()
is performed.
These files are included in the package for user convenience and are
simply UniProt files with gene annotations for human and mouse.
For more details on how this files were generated use the help
?tumor_suppressors
, ?proto_oncogenes
The default values are included in this package and it can be accessed by doing:
known_clinical_oncogenes()
If the user wants to change this parameter the input data frame must
preserve the column structure. The same goes for the suspicious_genes
parameter (DOIReference column is optional):
clinical_relevant_suspicious_genes()
A plot or a list containing a plot and a data frame
The function will explicitly check for the presence of these tags:
gene_symbol
Other Plotting functions:
HSC_population_plot()
,
circos_genomic_density()
,
fisher_scatterplot()
,
integration_alluvial_plot()
,
sharing_heatmap()
,
sharing_venn()
,
top_abund_tableGrob()
,
top_cis_overtime_heatmap()
data("integration_matrices", package = "ISAnalytics") cis_plot <- CIS_volcano_plot(integration_matrices, title_prefix = "PJ01" ) cis_plot
data("integration_matrices", package = "ISAnalytics") cis_plot <- CIS_volcano_plot(integration_matrices, title_prefix = "PJ01" ) cis_plot
Clinical relevant suspicious genes (for mouse and human).
clinical_relevant_suspicious_genes()
clinical_relevant_suspicious_genes()
A data frame
Other Plotting function helpers:
known_clinical_oncogenes()
clinical_relevant_suspicious_genes()
clinical_relevant_suspicious_genes()
Takes a list of integration matrices referring to different quantification types and merges them into a single data frame with multiple value columns, each renamed according to their quantification type of reference.
comparison_matrix( x, fragmentEstimate = "fragmentEstimate", seqCount = "seqCount", barcodeCount = "barcodeCount", cellCount = "cellCount", ShsCount = "ShsCount", value_col_name = "Value" )
comparison_matrix( x, fragmentEstimate = "fragmentEstimate", seqCount = "seqCount", barcodeCount = "barcodeCount", cellCount = "cellCount", ShsCount = "ShsCount", value_col_name = "Value" )
x |
A named list of integration matrices, ideally obtained via
import_parallel_Vispa2Matrices. Names must be
quantification types in |
fragmentEstimate |
The name of the output column for fragment estimate values |
seqCount |
The name of the output column for sequence count values |
barcodeCount |
The name of the output column for barcode count values |
cellCount |
The name of the output column for cell count values |
ShsCount |
The name of the output column for Shs count values |
value_col_name |
Name of the column containing the corresponding values in the single matrices |
A single data frame
Other Utilities:
as_sparse_matrix()
,
enable_progress_bars()
,
export_ISA_settings()
,
generate_Vispa2_launch_AF()
,
generate_blank_association_file()
,
generate_default_folder_structure()
,
import_ISA_settings()
,
separate_quant_matrices()
,
transform_columns()
sc <- tibble::tribble( ~chr, ~integration_locus, ~strand, ~CompleteAmplificationID, ~Value, "1", 45324, "+", "ID1", 543, "2", 52423, "-", "ID1", 42, "6", 54623, "-", "ID2", 67, "X", 12314, "+", "ID3", 8 ) fe <- tibble::tribble( ~chr, ~integration_locus, ~strand, ~CompleteAmplificationID, ~Value, "1", 45324, "+", "ID1", 56.76, "2", 52423, "-", "ID1", 78.32, "6", 54623, "-", "ID2", 123.45, "X", 12314, "+", "ID3", 5.34 ) comparison_matrix(list( fragmentEstimate = fe, seqCount = sc ))
sc <- tibble::tribble( ~chr, ~integration_locus, ~strand, ~CompleteAmplificationID, ~Value, "1", 45324, "+", "ID1", 543, "2", 52423, "-", "ID1", 42, "6", 54623, "-", "ID2", 67, "X", 12314, "+", "ID3", 8 ) fe <- tibble::tribble( ~chr, ~integration_locus, ~strand, ~CompleteAmplificationID, ~Value, "1", 45324, "+", "ID1", 56.76, "2", 52423, "-", "ID1", 78.32, "6", 54623, "-", "ID2", 123.45, "X", 12314, "+", "ID3", 5.34 ) comparison_matrix(list( fragmentEstimate = fe, seqCount = sc ))
Abundance is obtained for every integration event by calculating the ratio between the single value and the total value for the given group.
compute_abundance( x, columns = c("fragmentEstimate_sum"), percentage = TRUE, key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), keep_totals = FALSE )
compute_abundance( x, columns = c("fragmentEstimate_sum"), percentage = TRUE, key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), keep_totals = FALSE )
x |
An integration matrix - aka a data frame that includes
the |
columns |
A character vector of column names to process, must be numeric or integer columns |
percentage |
Add abundance as percentage? |
key |
The key to group by when calculating totals |
keep_totals |
A value between |
Abundance will be computed upon the user selected columns
in the columns
parameter. For each column a corresponding
relative abundance column (and optionally a percentage abundance
column) will be produced.
Either a single data frame with computed abundance values or a list of 2 data frames (abundance_df, quant_totals)
The function will explicitly check for the presence of these tags:
All columns declared in mandatory_IS_vars()
Other Analysis functions:
CIS_grubbs()
,
HSC_population_size_estimate()
,
cumulative_is()
,
gene_frequency_fisher()
,
is_sharing()
,
iss_source()
,
sample_statistics()
,
top_integrations()
,
top_targeted_genes()
data("integration_matrices", package = "ISAnalytics") abund <- compute_abundance( x = integration_matrices, columns = "fragmentEstimate", key = "CompleteAmplificationID" ) head(abund)
data("integration_matrices", package = "ISAnalytics") abund <- compute_abundance( x = integration_matrices, columns = "fragmentEstimate", key = "CompleteAmplificationID" ) head(abund)
This function scans the input integration matrix to detect eventual integration sites that are too "near" to each other and merges them into single integration sites adjusting their values if needed.
compute_near_integrations( x, threshold = 4, is_identity_tags = c("chromosome", "is_strand"), keep_criteria = c("max_value", "keep_first"), value_columns = c("seqCount", "fragmentEstimate"), max_value_column = "seqCount", sample_id_column = pcr_id_column(), additional_agg_lambda = list(.default = default_rec_agg_lambdas()), max_workers = 4, map_as_file = TRUE, file_path = default_report_path(), strand_specific = lifecycle::deprecated() )
compute_near_integrations( x, threshold = 4, is_identity_tags = c("chromosome", "is_strand"), keep_criteria = c("max_value", "keep_first"), value_columns = c("seqCount", "fragmentEstimate"), max_value_column = "seqCount", sample_id_column = pcr_id_column(), additional_agg_lambda = list(.default = default_rec_agg_lambdas()), max_workers = 4, map_as_file = TRUE, file_path = default_report_path(), strand_specific = lifecycle::deprecated() )
x |
An integration matrix |
threshold |
A single integer that represents an absolute
number of bases for which two integrations are considered distinct.
If the threshold is set to 3 it means, provided fields |
is_identity_tags |
Character vector of tags that identify the
integration event as distinct (except for |
keep_criteria |
While scanning, which integration should be kept? The 2 possible choices for this parameter are:
|
value_columns |
Character vector, contains the names of the numeric experimental columns |
max_value_column |
The column that has to be considered for searching the maximum value |
sample_id_column |
The name of the column containing the sample identifier |
additional_agg_lambda |
A named list containing aggregating functions for additional columns. See details. |
max_workers |
Maximum parallel workers allowed |
map_as_file |
Produce recalibration map as a .tsv file? |
file_path |
String representing the path were the file will be
saved. Must be a folder. Relevant only if |
strand_specific |
An integration event is uniquely identified by all fields specified in
the mandatory_IS_vars()
look-up table. It can happen to find IS that
are formally distinct (different combination of values in the fields),
but that should not considered distinct in practice,
since they represent the same integration event - this may be due
to artefacts at the putative locus of the IS in the merging of multiple
sequencing libraries.
We say that an integration event IS1 is near to another integration event
IS2 if the absolute difference of their loci is strictly lower than the
set threshold
.
There is also another aspect to be considered. Since the algorithm is based on a sliding window mechanism, on which groups of IS should we set and slide the window?
By default, we have 3 fields in the mandatory_IS_vars()
:
chr, integration_locus, strand, and we assume that all the fields
contribute to the identity of the IS. This means that IS1 and IS2 can be
compared only if they have the same chromosome and the same strand.
However, if we would like to exclude the strand of the integration from
our considerations then IS1 and IS2 can be selected from all the events
that fall on the same chromosome. A practical example:
IS1 = (chr = "1", strand = "+", integration_locus = 14568)
IS2 = (chr = "1", strand = "-", integration_locus = 14567)
if is_identity_tags = c("chromosome", "is_strand")
IS1 and IS2 are
considered distinct because they differ in strand, therefore no correction
will be applied to loci of either of the 2.
If is_identity_tags = c("chromosome")
then IS1 and IS2 are considered
near, because the strand is irrelevant, hence one of the 2 IS will change
locus.
IS that fall in the same interval are evaluated according to the criterion selected - if recalibration is necessary, rows with the same sample ID are aggregated in a single row with a quantification value that is the sum of all the merged rows.
If the input integration matrix contains annotation columns, that is additional columns that are not
part of the mandatory IS vars (see mandatory_IS_vars()
)
part of the annotation IS vars (see annotation_IS_vars()
)
the sample identifier column
the quantification column
it is possible to specify how they should be aggregated.
Defaults are provided for each column type (character, integer, numeric...),
but custom functions can be specified as a named list, where names are
column names in x
and values are functions to be applied.
NOTE: functions must be purrr-style lambdas and they must perform some kind
of aggregating operation, aka they must take a vector as input and return
a single value. The type of the output should match the type of the
target column. If you specify custom lambdas, provide defaults in the
special element .defaults
.
Example:
list( numeric_col = ~ sum(.x), char_col = ~ paste0(.x, collapse = ", "), .defaults = default_rec_agg_lambdas() )
An integration matrix with same or less number of rows
The function will explicitly check for the presence of these tags:
chromosome
locus
is_strand
gene_symbol
We do recommend to use this function in combination with comparison_matrix to automatically perform re-calibration on all quantification matrices.
Other Data cleaning and pre-processing:
aggregate_metadata()
,
aggregate_values_by_key()
,
default_meta_agg()
,
outlier_filter()
,
outliers_by_pool_fragments()
,
purity_filter()
,
realign_after_collisions()
,
remove_collisions()
,
threshold_filter()
data("integration_matrices", package = "ISAnalytics") rec <- compute_near_integrations( x = integration_matrices, map_as_file = FALSE ) head(rec)
data("integration_matrices", package = "ISAnalytics") rec <- compute_near_integrations( x = integration_matrices, map_as_file = FALSE ) head(rec)
Given an input integration matrix that can be grouped over time, this function adds integrations in groups assuming that if an integration is observed at time point "t" then it is also observed in time point "t+1".
cumulative_is( x, key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), timepoint_col = "TimePoint", include_tp_zero = FALSE, counts = TRUE, keep_og_is = FALSE, expand = TRUE )
cumulative_is( x, key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), timepoint_col = "TimePoint", include_tp_zero = FALSE, counts = TRUE, keep_og_is = FALSE, expand = TRUE )
x |
An integration matrix, ideally aggregated via
|
key |
The aggregation key used |
timepoint_col |
The name of the time point column |
include_tp_zero |
Should time point 0 be included? |
counts |
Add cumulative counts? Logical |
keep_og_is |
Keep original set of integrations as a separate column? |
expand |
If |
A data frame
The function will explicitly check for the presence of these tags:
All columns declared in mandatory_IS_vars()
Checks if the matrix is annotated by assessing presence of
annotation_IS_vars()
Other Analysis functions:
CIS_grubbs()
,
HSC_population_size_estimate()
,
compute_abundance()
,
gene_frequency_fisher()
,
is_sharing()
,
iss_source()
,
sample_statistics()
,
top_integrations()
,
top_targeted_genes()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) cumulated_is <- cumulative_is(aggreg) cumulated_is
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) cumulated_is <- cumulative_is(aggreg) cumulated_is
dates_format
parameter in
import_association_file
,
import_parallel_vispa2Matrices_interactive
and
import_parallel_vispa2Matrices_auto
.All options correspond to lubridate
functions, see more in the dedicated
package documentation.
date_formats()
date_formats()
A character vector
import_association_file
,
import_parallel_Vispa2Matrices_auto
Other Import functions helpers:
annotation_issues()
,
default_af_transform()
,
default_iss_file_prefixes()
,
matching_options()
,
quantification_types()
date_formats()
date_formats()
A list of default transformations to apply to the association file columns
after importing it via import_association_file()
default_af_transform(convert_tp)
default_af_transform(convert_tp)
convert_tp |
The value of the argument |
A named list of lambdas
Other Import functions helpers:
annotation_issues()
,
date_formats()
,
default_iss_file_prefixes()
,
matching_options()
,
quantification_types()
default_af_transform(TRUE)
default_af_transform(TRUE)
Note that each element is a regular expression.
default_iss_file_prefixes()
default_iss_file_prefixes()
A character vector of regexes
Other Import functions helpers:
annotation_issues()
,
date_formats()
,
default_af_transform()
,
matching_options()
,
quantification_types()
default_iss_file_prefixes()
default_iss_file_prefixes()
A default columns-function specifications for aggregate_metadata
default_meta_agg()
default_meta_agg()
This data frame contains four columns:
Column
: holds the name of the column in the association file that
should be processed
Function
: contains either the name of a function (e.g. mean)
or a purrr-style lambda (e.g. ~ mean(.x, na.rm = TRUE)
). This function
will be applied to the corresponding column specified in Column
Args
: optional additional arguments to pass to the corresponding
function. This is relevant ONLY if the corresponding Function
is a
simple function and not a purrr-style lambda.
Output_colname
: a glue
specification that will be used to determine
a unique output column name. See glue for more details.
A data frame
Other Data cleaning and pre-processing:
aggregate_metadata()
,
aggregate_values_by_key()
,
compute_near_integrations()
,
outlier_filter()
,
outliers_by_pool_fragments()
,
purity_filter()
,
realign_after_collisions()
,
remove_collisions()
,
threshold_filter()
default_meta_agg()
default_meta_agg()
compute_near_integrations()
.Defaults for column aggregations in compute_near_integrations()
.
default_rec_agg_lambdas()
default_rec_agg_lambdas()
A named list of lambdas
default_rec_agg_lambdas()
default_rec_agg_lambdas()
Default folder for saving ISAnalytics reports. Supplied as default argument for several functions.
default_report_path()
default_report_path()
A path
default_report_path()
default_report_path()
sample_statistics
.A set of pre-defined functions for sample_statistics
.
default_stats()
default_stats()
A named list of functions/purrr-style lambdas
default_stats()
default_stats()
This is a simple wrapper around functions from the package
progressr
. To customize the appearance of the progress bar,
please refer to progressr
documentation.
enable_progress_bars()
enable_progress_bars()
NULL
Other Utilities:
as_sparse_matrix()
,
comparison_matrix()
,
export_ISA_settings()
,
generate_Vispa2_launch_AF()
,
generate_blank_association_file()
,
generate_default_folder_structure()
,
import_ISA_settings()
,
separate_quant_matrices()
,
transform_columns()
enable_progress_bars() progressr::handlers(global = FALSE) # Deactivate
enable_progress_bars() progressr::handlers(global = FALSE) # Deactivate
This function allows exporting the currently set dynamic
vars in json format so it can be quickly imported later. Dynamic
variables need to be properly set via the setter functions before calling
the function. For more details, refer to the dedicated vignette
vignette("workflow_start", package="ISAnalytics")
.
export_ISA_settings(folder, setting_profile_name)
export_ISA_settings(folder, setting_profile_name)
folder |
The path to the folder where the file should be saved. If the folder doesn't exist, it gets created automatically |
setting_profile_name |
A name for the settings profile |
NULL
Other Utilities:
as_sparse_matrix()
,
comparison_matrix()
,
enable_progress_bars()
,
generate_Vispa2_launch_AF()
,
generate_blank_association_file()
,
generate_default_folder_structure()
,
import_ISA_settings()
,
separate_quant_matrices()
,
transform_columns()
tmp_folder <- tempdir() export_ISA_settings(tmp_folder, "DEFAULT")
tmp_folder <- tempdir() export_ISA_settings(tmp_folder, "DEFAULT")
Plots results of Fisher's exact test on gene frequency obtained via
gene_frequency_fisher()
as a scatterplot.
fisher_scatterplot( fisher_df, p_value_col = "Fisher_p_value_fdr", annot_threshold = 0.05, annot_color = "red", gene_sym_col = "GeneName", do_not_highlight = NULL, keep_not_highlighted = TRUE )
fisher_scatterplot( fisher_df, p_value_col = "Fisher_p_value_fdr", annot_threshold = 0.05, annot_color = "red", gene_sym_col = "GeneName", do_not_highlight = NULL, keep_not_highlighted = TRUE )
fisher_df |
Test results obtained via |
p_value_col |
Name of the column containing the p-value to consider |
annot_threshold |
Annotate with a different color if a point is below the significance threshold. Single numerical value. |
annot_color |
The color in which points below the threshold should be annotated |
gene_sym_col |
The name of the column containing the gene symbol |
do_not_highlight |
Either |
keep_not_highlighted |
If present, how should not highlighted genes
be treated? If set to |
In some cases, users might want to avoid highlighting certain genes
even if their p-value is below the threshold. To do so, use the
argument do_not_highlight
: character vectors are appropriate for specific
genes that are to be excluded, expressions or lambdas allow a finer control.
For example we can supply:
expr <- rlang::expr(!stringr::str_starts(GeneName, "MIR") & average_TxLen_1 >= 300)
with this expression, genes that have a p-value < threshold and start with
"MIR" or have an average_TxLen_1 lower than 300 are excluded from the
highlighted points.
NOTE: keep in mind that expressions are evaluated inside a dplyr::filter
context.
Similarly, lambdas are passed to the filtering function but only operate on the column containing the gene symbol.
lambda <- ~ stringr::str_starts(.x, "MIR")
A plot
Other Plotting functions:
CIS_volcano_plot()
,
HSC_population_plot()
,
circos_genomic_density()
,
integration_alluvial_plot()
,
sharing_heatmap()
,
sharing_venn()
,
top_abund_tableGrob()
,
top_cis_overtime_heatmap()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) cis <- CIS_grubbs(aggreg, by = "SubjectID") fisher <- gene_frequency_fisher(cis$cis$PT001, cis$cis$PT002, min_is_per_gene = 2 ) fisher_scatterplot(fisher)
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) cis <- CIS_grubbs(aggreg, by = "SubjectID") fisher <- gene_frequency_fisher(cis$cis$PT001, cis$cis$PT002, min_is_per_gene = 2 ) fisher_scatterplot(fisher)
Provided 2 data frames with calculations for CIS, via CIS_grubbs()
,
computes Fisher's exact test.
Results can be plotted via fisher_scatterplot()
.
gene_frequency_fisher( cis_x, cis_y, min_is_per_gene = 3, gene_set_method = c("intersection", "union"), onco_db_file = "proto_oncogenes", tumor_suppressors_db_file = "tumor_suppressors", species = "human", known_onco = known_clinical_oncogenes(), suspicious_genes = clinical_relevant_suspicious_genes(), significance_threshold = 0.05, remove_unbalanced_0 = TRUE )
gene_frequency_fisher( cis_x, cis_y, min_is_per_gene = 3, gene_set_method = c("intersection", "union"), onco_db_file = "proto_oncogenes", tumor_suppressors_db_file = "tumor_suppressors", species = "human", known_onco = known_clinical_oncogenes(), suspicious_genes = clinical_relevant_suspicious_genes(), significance_threshold = 0.05, remove_unbalanced_0 = TRUE )
cis_x |
A data frame obtained via |
cis_y |
A data frame obtained via |
min_is_per_gene |
Used for pre-filtering purposes. Genes with a number of distinct integration less than this number will be filtered out prior calculations. Single numeric or integer. |
gene_set_method |
One between "intersection" and "union". When merging
the 2 data frames, |
onco_db_file |
Uniprot file for proto-oncogenes (see details). If different from default, should be supplied as a path to a file. |
tumor_suppressors_db_file |
Uniprot file for tumor-suppressor genes. If different from default, should be supplied as a path to a file. |
species |
One between |
known_onco |
Data frame with known oncogenes. See details. |
suspicious_genes |
Data frame with clinical relevant suspicious genes. See details. |
significance_threshold |
Significance threshold for the Fisher's test p-value |
remove_unbalanced_0 |
Remove from the final output those pairs in which there are no IS for one group or the other and the number of IS of the non-missing group are less than the mean number of IS for that group |
These files are included in the package for user convenience and are
simply UniProt files with gene annotations for human and mouse.
For more details on how this files were generated use the help
?tumor_suppressors
, ?proto_oncogenes
The default values are included in this package and it can be accessed by doing:
known_clinical_oncogenes()
If the user wants to change this parameter the input data frame must
preserve the column structure. The same goes for the suspicious_genes
parameter (DOIReference column is optional):
clinical_relevant_suspicious_genes()
A data frame
The function will explicitly check for the presence of these tags:
gene_symbol
Other Analysis functions:
CIS_grubbs()
,
HSC_population_size_estimate()
,
compute_abundance()
,
cumulative_is()
,
is_sharing()
,
iss_source()
,
sample_statistics()
,
top_integrations()
,
top_targeted_genes()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) cis <- CIS_grubbs(aggreg, by = "SubjectID") fisher <- gene_frequency_fisher(cis$cis$PT001, cis$cis$PT002, min_is_per_gene = 2 ) fisher
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) cis <- CIS_grubbs(aggreg, by = "SubjectID") fisher <- gene_frequency_fisher(cis$cis$PT001, cis$cis$PT002, min_is_per_gene = 2 ) fisher
Produces a blank association file to start using both VISPA2 and ISAnalytics
generate_blank_association_file(path)
generate_blank_association_file(path)
path |
The path on disk where the file should be written - must be a file |
NULL
Other Utilities:
as_sparse_matrix()
,
comparison_matrix()
,
enable_progress_bars()
,
export_ISA_settings()
,
generate_Vispa2_launch_AF()
,
generate_default_folder_structure()
,
import_ISA_settings()
,
separate_quant_matrices()
,
transform_columns()
temp <- tempfile() generate_blank_association_file(temp)
temp <- tempfile() generate_blank_association_file(temp)
The function produces a folder structure in the file system at the provided path that respects VISPA2 standards, with package-included data.
generate_default_folder_structure( type = "correct", dir = tempdir(), af = "default", matrices = "default" )
generate_default_folder_structure( type = "correct", dir = tempdir(), af = "default", matrices = "default" )
type |
One value between |
dir |
Path to the folder in which the structure will be produced |
af |
Either |
matrices |
Either |
A named list containing the path to the association file and the path to the top level folder(s) of the structure
The function will explicitly check for the presence of these tags:
project_id
tag_seq
vispa_concatenate
Other Utilities:
as_sparse_matrix()
,
comparison_matrix()
,
enable_progress_bars()
,
export_ISA_settings()
,
generate_Vispa2_launch_AF()
,
generate_blank_association_file()
,
import_ISA_settings()
,
separate_quant_matrices()
,
transform_columns()
fs_path <- generate_default_folder_structure(type = "correct") fs_path
fs_path <- generate_default_folder_structure(type = "correct") fs_path
The function selects the appropriate columns and prepares a file for the launch of VISPA2 pipeline for each project/pool pair specified.
generate_Vispa2_launch_AF(association_file, project, pool, path)
generate_Vispa2_launch_AF(association_file, project, pool, path)
association_file |
The imported association file (via
|
project |
A vector of characters containing project names |
pool |
A vector of characters containing pool names |
path |
A single string representing the path to the folder where files should be written. If the folder doesn't exist it will be created. |
Note: the function is vectorized, meaning you can specify more than one project and more than one pool as vectors of characters, but you must ensure that:
Both project
and pool
vectors have the same length
You correclty type names in corresponding positions, for example c("PJ01", "PJ01") - c("POOL01", "POOL02"). If you type a pool in the position of a corresponding project that doesn't match no file will be produced since that pool doesn't exist in the corresponding project.
NULL
The function will explicitly check for the presence of these tags:
cell_marker
fusion_id
pcr_repl_id
pool_id
project_id
subject
tag_id
tissue
tp_days
vector_id
The names of the pools in the pool
argument is checked against the
column corresponding to the pool_id
tag.
Other Utilities:
as_sparse_matrix()
,
comparison_matrix()
,
enable_progress_bars()
,
export_ISA_settings()
,
generate_blank_association_file()
,
generate_default_folder_structure()
,
import_ISA_settings()
,
separate_quant_matrices()
,
transform_columns()
temp <- tempdir() data("association_file", package = "ISAnalytics") generate_Vispa2_launch_AF(association_file, "PJ01", "POOL01", temp)
temp <- tempdir() data("association_file", package = "ISAnalytics") generate_Vispa2_launch_AF(association_file, "PJ01", "POOL01", temp)
Plot of the estimated HSC population size for each patient.
HSC_population_plot( estimates, project_name, timepoints = "Consecutive", models = "Mth Chao (LB)" )
HSC_population_plot( estimates, project_name, timepoints = "Consecutive", models = "Mth Chao (LB)" )
estimates |
The estimates data frame, obtained via
|
project_name |
The project name, will be included in the plot title |
timepoints |
Which time points to plot? One between "All", "Stable" and "Consecutive" |
models |
Name of the models to plot (as they appear in the column of the estimates) |
A plot
Other Plotting functions:
CIS_volcano_plot()
,
circos_genomic_density()
,
fisher_scatterplot()
,
integration_alluvial_plot()
,
sharing_heatmap()
,
sharing_venn()
,
top_abund_tableGrob()
,
top_cis_overtime_heatmap()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) aggreg_meta <- aggregate_metadata( association_file = association_file ) estimate <- HSC_population_size_estimate( x = aggreg, metadata = aggreg_meta, stable_timepoints = c(90, 180, 360), cell_type = "Other" ) p <- HSC_population_plot(estimate$est, "PJ01") p
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) aggreg_meta <- aggregate_metadata( association_file = association_file ) estimate <- HSC_population_size_estimate( x = aggreg, metadata = aggreg_meta, stable_timepoints = c(90, 180, 360), cell_type = "Other" ) p <- HSC_population_plot(estimate$est, "PJ01") p
Hematopoietic stem cells population size estimate with capture-recapture models.
HSC_population_size_estimate( x, metadata, stable_timepoints = NULL, aggregation_key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), blood_lineages = blood_lineages_default(), timepoint_column = "TimePoint", seqCount_column = "seqCount_sum", fragmentEstimate_column = "fragmentEstimate_sum", seqCount_threshold = 3, fragmentEstimate_threshold = 3, nIS_threshold = 5, cell_type = "MYELOID", tissue_type = "PB", max_workers = 4 )
HSC_population_size_estimate( x, metadata, stable_timepoints = NULL, aggregation_key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), blood_lineages = blood_lineages_default(), timepoint_column = "TimePoint", seqCount_column = "seqCount_sum", fragmentEstimate_column = "fragmentEstimate_sum", seqCount_threshold = 3, fragmentEstimate_threshold = 3, nIS_threshold = 5, cell_type = "MYELOID", tissue_type = "PB", max_workers = 4 )
x |
An aggregated integration matrix. See details. |
metadata |
An aggregated association file. See details. |
stable_timepoints |
A numeric vector or NULL if there are no stable time points. NOTE: the vector is NOT intended as a sequence min-max, every stable time point has to be specified individually |
aggregation_key |
A character vector indicating the key used for aggregating x and metadata. Note that x and metadata should always be aggregated with the same key. |
blood_lineages |
A data frame containing information on the blood
lineages. Users can supply their own, provided the columns |
timepoint_column |
What is the name of the time point column to use? Note that this column must be present in the key. |
seqCount_column |
What is the name of the column in x containing the values of sequence count quantification? |
fragmentEstimate_column |
What is the name of the column in x
containing the values of fragment estimate quantification? If fragment
estimate is not present in the matrix, param should be set to |
seqCount_threshold |
A single numeric value. After re-aggregating |
fragmentEstimate_threshold |
A single numeric value. Threshold value for fragment estimate, see details. |
nIS_threshold |
A single numeric value. If a group (row) in the metadata data frame has a count of distinct integration sites strictly greater than this number it will be kept, otherwise discarded. |
cell_type |
The cell types to include in the models. Note that the matching is case-insensitive. |
tissue_type |
The tissue types to include in the models. Note that the matching is case-insensitive. |
max_workers |
Maximum parallel workers allowed |
A data frame with the results of the estimates
Both x
and metadata
should be supplied to the function in aggregated
format (ideally through the use of aggregate_metadata
and aggregate_values_by_key
).
Note that the aggregation_key
, aka the vector of column names used for
aggregation, must contain at least the columns associated with the tags
subject
, cell_marker
, tissue
and a time point column
(the user can specify the name of the
column in the argument timepoint_column
).
Groups for the estimates are computed as a pair of cell type and tissue.
If the user wishes to compute estimates for more than one combination
of cell type and tissue, it is possible to specify them as character
vectors to the fields cell_type
and tissue_type
respectively,
noting that:
Vectors must have the same length or one of the 2 has to be of length 1
It is a responsibility of the user to check whether the combination exists in the dataset provided.
Example:
estimate <- HSC_population_size_estimate( x = aggreg, metadata = aggreg_meta, cell_type = c("MYELOID", "T", "B"), tissue_type = "PB" ) # Evaluated groups will be: # - MYELOID PB # - T PB # - B PB
Note that estimates are computed individually for each group.
If stable_timepoints
is a vector with length > 1, the function will look
for the first available stable time point and slice the data from that
time point onward. If NULL
is supplied instead, it means there are no
stable time points available. Note that 0 time points are ALWAYS discarded.
Also, to be included in the analysis, a group must have at least 2
distinct non-zero time points.
NOTE: the vector passed has to contain all individual time points, not
just the minimum and maximum
If fragment estimate is present in the input matrix, the filtering logic
changes slightly: rows in the original matrix are kept if the sequence
count value is greater or equal than the seqCount_threshold
AND
the fragment estimate value is greater or equal to the
fragmentEstimate_threshold
IF PRESENT (non-zero value).
This means that for rows that miss fragment estimate, the filtering logic
will be applied only on sequence count. If the user wishes not to use
the combined filtering with fragment estimate, simply set
fragmentEstimate_threshold = 0
.
The function will explicitly check for the presence of these tags:
subject
tissue
cell_marker
Other Analysis functions:
CIS_grubbs()
,
compute_abundance()
,
cumulative_is()
,
gene_frequency_fisher()
,
is_sharing()
,
iss_source()
,
sample_statistics()
,
top_integrations()
,
top_targeted_genes()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) aggreg_meta <- aggregate_metadata(association_file = association_file) estimate <- HSC_population_size_estimate( x = aggreg, metadata = aggreg_meta, fragmentEstimate_column = NULL, stable_timepoints = c(90, 180, 360), cell_type = "Other" )
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) aggreg_meta <- aggregate_metadata(association_file = association_file) estimate <- HSC_population_size_estimate( x = aggreg, metadata = aggreg_meta, fragmentEstimate_column = NULL, stable_timepoints = c(90, 180, 360), cell_type = "Other" )
Imports the association file and optionally performs a check on the file system starting from the root to assess the alignment between the two.
import_association_file( path, root = NULL, dates_format = "ymd", separator = "\t", filter_for = NULL, import_iss = FALSE, convert_tp = TRUE, report_path = default_report_path(), transformations = default_af_transform(convert_tp), tp_padding = lifecycle::deprecated(), ... )
import_association_file( path, root = NULL, dates_format = "ymd", separator = "\t", filter_for = NULL, import_iss = FALSE, convert_tp = TRUE, report_path = default_report_path(), transformations = default_af_transform(convert_tp), tp_padding = lifecycle::deprecated(), ... )
path |
The path on disk to the association file. |
root |
The path on disk of the root folder of VISPA2 output or |
dates_format |
A single string indicating how dates should be parsed.
Must be a value in: |
separator |
The column separator used in the file |
filter_for |
A named list where names represent column names that
must be filtered. For example: |
import_iss |
Import VISPA2 pool stats and merge them with the association file? Logical value |
convert_tp |
Should be time points be converted into months and years? Logical value |
report_path |
The path where the report file should be saved.
Can be a folder or |
transformations |
Either |
tp_padding |
|
... |
Additional arguments to pass to
|
Lambdas provided in input in the transformations
argument,
must be transformations, aka functions that take
in input a vector and return a vector of the same length as the input.
If the transformation list contains column names that are not present in the data frame, they are simply ignored.
If the root
argument is set to NULL
no file system alignment is
performed. This allows to import the basic file but it won't be
possible to perform automated matrix and stats import.
For more details see the "How to use import functions" vignette:
vignette("workflow_start", package = "ISAnalytics")
The time point conversion is based on the following logic, given TPD
is the column containing the time point expressed in days and
TPM
and TPY
are respectively the time points expressed as month
and years
If TPD
is NA
–> NA
(for both months and years)
TPM
= 0, TPY
= 0 if and only if TPD
= 0
For conversion in months:
TPM
= ceiling(TPD
/30) if TPD
< 30 otherwise TPM
= round(TPD
/30)
For conversion in years:
TPY
= ceiling(TPD
/360)
The data frame containing metadata
The function will explicitly check for the presence of these tags:
project_id
pool_id
tag_seq
subject
tissue
tp_days
cell_marker
pcr_replicate
vispa_concatenate
pcr_repl_id
proj_folder
The function will use all the available specifications contained in
association_file_columns(TRUE)
to read and parse the file.
If the specifications contain columns with a type "date"
, the function
will parse the generic date with the format in the dates_format
argument.
Other Import functions:
import_Vispa2_stats()
,
import_parallel_Vispa2Matrices()
,
import_single_Vispa2Matrix()
fs_path <- generate_default_folder_structure(type = "correct") af <- import_association_file(fs_path$af, root = fs_path$root, report_path = NULL ) head(af)
fs_path <- generate_default_folder_structure(type = "correct") af <- import_association_file(fs_path$af, root = fs_path$root, report_path = NULL ) head(af)
The function allows the import of an existing dynamic vars
profile in json format. This is a quick and convenient way to set up
the workflow, alternative to specifying lookup tables manually through
the corresponding setter functions. For more details,
refer to the dedicated vignette
vignette("workflow_start", package="ISAnalytics")
.
import_ISA_settings(path)
import_ISA_settings(path)
path |
The path to the json file on disk |
NULL
Other Utilities:
as_sparse_matrix()
,
comparison_matrix()
,
enable_progress_bars()
,
export_ISA_settings()
,
generate_Vispa2_launch_AF()
,
generate_blank_association_file()
,
generate_default_folder_structure()
,
separate_quant_matrices()
,
transform_columns()
tmp_folder <- tempdir() export_ISA_settings(tmp_folder, "DEFAULT") import_ISA_settings(fs::path(tmp_folder, "DEFAULT_ISAsettings.json")) reset_dyn_vars_config()
tmp_folder <- tempdir() export_ISA_settings(tmp_folder, "DEFAULT") import_ISA_settings(fs::path(tmp_folder, "DEFAULT_ISAsettings.json")) reset_dyn_vars_config()
The function offers a convenient way of importing multiple integration
matrices in an automated or semi-automated way.
For more details see the "How to use import functions" vignette:
vignette("workflow_start", package = "ISAnalytics")
import_parallel_Vispa2Matrices( association_file, quantification_type = c("seqCount", "fragmentEstimate"), matrix_type = c("annotated", "not_annotated"), workers = 2, multi_quant_matrix = TRUE, report_path = default_report_path(), patterns = NULL, matching_opt = matching_options(), mode = "AUTO", ... )
import_parallel_Vispa2Matrices( association_file, quantification_type = c("seqCount", "fragmentEstimate"), matrix_type = c("annotated", "not_annotated"), workers = 2, multi_quant_matrix = TRUE, report_path = default_report_path(), patterns = NULL, matching_opt = matching_options(), mode = "AUTO", ... )
association_file |
Data frame imported via import_association_file (with file system alignment) |
quantification_type |
A vector of requested quantification_types. Possible choices are quantification_types |
matrix_type |
A single string representing the type of matrices to be imported. Can only be one in "annotated" or "not_annotated". |
workers |
A single integer representing the number of parallel workers to use for the import |
multi_quant_matrix |
If set to |
report_path |
The path where the report file should be saved.
Can be a folder or |
patterns |
A character vector of additional patterns to match on file
names. Please note that patterns must be regular expressions. Can be |
matching_opt |
A single value between matching_options |
mode |
Only |
... |
< |
Either a multi-quantification matrix or a list of integration matrices
The function will explicitly check for the presence of these tags:
project_id
vispa_concatenate
Other Import functions:
import_Vispa2_stats()
,
import_association_file()
,
import_single_Vispa2Matrix()
fs_path <- generate_default_folder_structure(type = "correct") af <- import_association_file(fs_path$af, root = fs_path$root, report_path = NULL ) matrices <- import_parallel_Vispa2Matrices(af, c("seqCount", "fragmentEstimate"), mode = "AUTO", report_path = NULL ) head(matrices)
fs_path <- generate_default_folder_structure(type = "correct") af <- import_association_file(fs_path$af, root = fs_path$root, report_path = NULL ) matrices <- import_parallel_Vispa2Matrices(af, c("seqCount", "fragmentEstimate"), mode = "AUTO", report_path = NULL ) head(matrices)
This function allows to read and import an integration matrix (ideally produced by VISPA2) and converts it to a tidy format.
import_single_Vispa2Matrix( path, separator = "\t", additional_cols = NULL, transformations = NULL, sample_names_to = pcr_id_column(), values_to = "Value", to_exclude = lifecycle::deprecated(), keep_excluded = lifecycle::deprecated() )
import_single_Vispa2Matrix( path, separator = "\t", additional_cols = NULL, transformations = NULL, sample_names_to = pcr_id_column(), values_to = "Value", to_exclude = lifecycle::deprecated(), keep_excluded = lifecycle::deprecated() )
path |
The path to the file on disk |
separator |
The column delimiter used, defaults to |
additional_cols |
Either |
transformations |
Either |
sample_names_to |
Name of the output column holding the sample
identifier. Defaults to |
values_to |
Name of the output column holding the quantification
values. Defaults to |
to_exclude |
|
keep_excluded |
Additional columns are annotation columns present in the integration matrix to import that are not
part of the mandatory IS vars (see mandatory_IS_vars()
)
part of the annotation IS vars (see annotation_IS_vars()
)
the sample identifier column
the quantification column
When specified they tell the function how to treat those columns in the import phase, by providing a named character vector, where names correspond to the additional column names and values are a choice of the following:
"char"
for character (strings)
"int"
for integers
"logi"
for logical values (TRUE / FALSE)
"numeric"
for numeric values
"factor"
for factors
"date"
for generic date format - note that functions that
need to read and parse files will try to guess the format and parsing
may fail
One of the accepted date/datetime formats by lubridate
,
you can use ISAnalytics::date_formats()
to view the accepted formats
"_"
to drop the column
For more details see the "How to use import functions" vignette:
vignette("workflow_start", package = "ISAnalytics")
Lambdas provided in input in the transformations
argument,
must be transformations, aka functions that take
in input a vector and return a vector of the same length as the input.
If the transformation list contains column names that are not present in the data frame, they are simply ignored.
A data frame object in tidy format
The function will explicitly check for the presence of these tags:
All columns declared in mandatory_IS_vars()
Other Import functions:
import_Vispa2_stats()
,
import_association_file()
,
import_parallel_Vispa2Matrices()
fs_path <- generate_default_folder_structure(type = "correct") matrix_path <- fs::path( fs_path$root, "PJ01", "quantification", "POOL01-1", "PJ01_POOL01-1_seqCount_matrix.no0.annotated.tsv.gz" ) matrix <- import_single_Vispa2Matrix(matrix_path) head(matrix)
fs_path <- generate_default_folder_structure(type = "correct") matrix_path <- fs::path( fs_path$root, "PJ01", "quantification", "POOL01-1", "PJ01_POOL01-1_seqCount_matrix.no0.annotated.tsv.gz" ) matrix <- import_single_Vispa2Matrix(matrix_path) head(matrix)
Imports all the Vispa2 stats files for each pool provided the association
file has been aligned with the file system
(see import_association_file
).
import_Vispa2_stats( association_file, file_prefixes = default_iss_file_prefixes(), join_with_af = TRUE, pool_col = "concatenatePoolIDSeqRun", report_path = default_report_path() )
import_Vispa2_stats( association_file, file_prefixes = default_iss_file_prefixes(), join_with_af = TRUE, pool_col = "concatenatePoolIDSeqRun", report_path = default_report_path() )
association_file |
The file system aligned association file (contains columns with absolute paths to the 'iss' folder) |
file_prefixes |
A character vector with known file prefixes to match on file names. NOTE: the elements represent regular expressions. For defaults see default_iss_file_prefixes. |
join_with_af |
Logical, if |
pool_col |
A single string. What is the name of the pool column
used in the Vispa2 run? This will be used as a key to perform a join
operation with the stats files |
report_path |
The path where the report file should be saved.
Can be a folder or |
A data frame
The function will explicitly check for the presence of these tags:
project_id
tag_seq
vispa_concatenate
pcr_repl_id
Other Import functions:
import_association_file()
,
import_parallel_Vispa2Matrices()
,
import_single_Vispa2Matrix()
fs_path <- generate_default_folder_structure(type = "correct") af <- import_association_file(fs_path$af, root = fs_path$root, import_iss = FALSE, report_path = NULL ) stats_files <- import_Vispa2_stats(af, join_with_af = FALSE, report_path = NULL ) head(stats_files)
fs_path <- generate_default_folder_structure(type = "correct") af <- import_association_file(fs_path$af, root = fs_path$root, import_iss = FALSE, report_path = NULL ) stats_files <- import_Vispa2_stats(af, join_with_af = FALSE, report_path = NULL ) head(stats_files)
Given one or multiple tags, prints the associated description and functions where the tag is explicitly used.
inspect_tags(tags)
inspect_tags(tags)
tags |
A character vector of tag names |
NULL
Other dynamic vars:
mandatory_IS_vars()
,
pcr_id_column()
,
reset_mandatory_IS_vars()
,
set_mandatory_IS_vars()
,
set_matrix_file_suffixes()
inspect_tags(c("chromosome", "project_id", "x"))
inspect_tags(c("chromosome", "project_id", "x"))
Alluvial plots allow the visualization of integration sites distribution in different points in time in the same group. This functionality requires the suggested package ggalluvial.
integration_alluvial_plot( x, group = c("SubjectID", "CellMarker", "Tissue"), plot_x = "TimePoint", plot_y = "fragmentEstimate_sum_PercAbundance", alluvia = mandatory_IS_vars(), alluvia_plot_y_threshold = 1, top_abundant_tbl = TRUE, empty_space_color = "grey90", ... )
integration_alluvial_plot( x, group = c("SubjectID", "CellMarker", "Tissue"), plot_x = "TimePoint", plot_y = "fragmentEstimate_sum_PercAbundance", alluvia = mandatory_IS_vars(), alluvia_plot_y_threshold = 1, top_abundant_tbl = TRUE, empty_space_color = "grey90", ... )
x |
A data frame. See details. |
group |
Character vector containing the column names that identify unique groups. |
plot_x |
Column name to plot on the x axis |
plot_y |
Column name to plot on the y axis |
alluvia |
Character vector of column names that uniquely identify alluvia |
alluvia_plot_y_threshold |
Numeric value. Everything below this threshold on y will be plotted in grey and aggregated. See details. |
top_abundant_tbl |
Logical. Produce the summary top abundant tables via top_abund_tableGrob? |
empty_space_color |
Color of the empty portion of the bars (IS below
the threshold). Can be either a string of known colors, an hex code or
|
... |
Additional arguments to pass on to top_abund_tableGrob |
The input data frame must contain all the columns specified in the
arguments group
, plot_x
, plot_y
and alluvia
. The standard
input for this function is the data frame obtained via the
compute_abundance function.
The plotting threshold on the quantification on the y axis has the
function to highlight only relevant information on the plot and reduce
computation time. The default value is 1, that acts on the default column
plotted on the y axis which contains a percentage value. This translates
in natural language roughly as "highlight with colors only those
integrations (alluvia) that at least in 1 point in time have an
abundance value >= 1 %". The remaining integrations will be plotted
as a unique layer in the column, colored as specified by the argument
empty_space_color
.
The returned plots are ggplot2 objects and can therefore further modified as any other ggplot2 object. For example, if the user decides to change the fill scale it is sufficient to do
plot + ggplot2::scale_fill_viridis_d(...) + # or any other discrete fill scale ggplot2::theme(...) # change theme options
NOTE: if you requested the computation of the top ten abundant tables and you want the colors to match you should re-compute them
Strata in each column are ordered first by time of appearance and secondly in decreasing order of abundance (value of y). It means, for example, that if the plot has 2 or more columns, in the second column, on top, will appear first appear IS that appeared in the previous columns and then all other IS, ordered in decreasing order of abundance.
For each group a list with the associated plot and optionally the summary tableGrob
Other Plotting functions:
CIS_volcano_plot()
,
HSC_population_plot()
,
circos_genomic_density()
,
fisher_scatterplot()
,
sharing_heatmap()
,
sharing_venn()
,
top_abund_tableGrob()
,
top_cis_overtime_heatmap()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) abund <- compute_abundance(x = aggreg) alluvial_plots <- integration_alluvial_plot(abund, alluvia_plot_y_threshold = 0.5 ) ex_plot <- alluvial_plots[[1]]$plot + ggplot2::labs( title = "IS distribution over time", subtitle = "Patient 1, MNC BM", y = "Abundance (%)", x = "Time point (days after GT)" ) print(ex_plot)
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) abund <- compute_abundance(x = aggreg) alluvial_plots <- integration_alluvial_plot(abund, alluvia_plot_y_threshold = 0.5 ) ex_plot <- alluvial_plots[[1]]$plot + ggplot2::labs( title = "IS distribution over time", subtitle = "Patient 1, MNC BM", y = "Abundance (%)", x = "Time point (days after GT)" ) print(ex_plot)
The data was obtained manually by simulating real research data.
data("integration_matrices")
data("integration_matrices")
Data frame with 1689 rows and 8 columns
The chromosome number (as character)
Number of the base at which the viral insertion occurred
Strand of the integration
Symbol of the closest gene
Strand of the closest gene
Unique sample identifier
Value of the sequence count quantification
Value of the fragment estimate quantification
Computes the amount of integration sites shared between the groups identified in the input data.
is_sharing( ..., group_key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), group_keys = NULL, n_comp = 2, is_count = TRUE, relative_is_sharing = TRUE, minimal = TRUE, include_self_comp = FALSE, keep_genomic_coord = FALSE, table_for_venn = FALSE )
is_sharing( ..., group_key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), group_keys = NULL, n_comp = 2, is_count = TRUE, relative_is_sharing = TRUE, minimal = TRUE, include_self_comp = FALSE, keep_genomic_coord = FALSE, table_for_venn = FALSE )
... |
One or more integration matrices |
group_key |
Character vector of column names which identify a single group. An associated group id will be derived by concatenating the values of these fields, separated by "_" |
group_keys |
A list of keys for asymmetric grouping.
If not NULL the argument |
n_comp |
Number of comparisons to compute. This argument is relevant only if provided a single data frame and a single key. |
is_count |
Logical, if |
relative_is_sharing |
Logical, if |
minimal |
Compute only combinations instead of all possible
permutations? If |
include_self_comp |
Include comparisons with the same group? |
keep_genomic_coord |
If |
table_for_venn |
Add column with truth tables for venn plots? |
An integration site is always identified by the combination of fields in
mandatory_IS_vars()
, thus these columns must be present
in the input(s).
The function accepts multiple inputs for different scenarios, please refer
to the vignette
vignette("workflow_start", package = "ISAnalytics")
for a more in-depth explanation.
The function outputs a single data frame containing all requested comparisons and optionally individual group counts, genomic coordinates of the shared integration sites and truth tables for plotting venn diagrams.
The sharing data obtained can be easily plotted in a heatmap via the
function sharing_heatmap
or via the function
sharing_venn
A data frame
The function will explicitly check for the presence of these tags:
All columns declared in mandatory_IS_vars()
Other Analysis functions:
CIS_grubbs()
,
HSC_population_size_estimate()
,
compute_abundance()
,
cumulative_is()
,
gene_frequency_fisher()
,
iss_source()
,
sample_statistics()
,
top_integrations()
,
top_targeted_genes()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) sharing <- is_sharing(aggreg) sharing
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) sharing <- is_sharing(aggreg) sharing
The function computes the sharing between a reference group of interest for each time point and a selection of groups of interest. In this way it is possible to observe the percentage of shared integration sites between reference and each group and identify in which time point a certain IS was observed for the first time.
iss_source( reference, selection, ref_group_key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), selection_group_key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), timepoint_column = "TimePoint", by_subject = TRUE, subject_column = "SubjectID" )
iss_source( reference, selection, ref_group_key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), selection_group_key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), timepoint_column = "TimePoint", by_subject = TRUE, subject_column = "SubjectID" )
reference |
A data frame containing one or more groups of reference.
Groups are identified by |
selection |
A data frame containing one or more groups of interest
to compare.
Groups are identified by |
ref_group_key |
Character vector of column names that identify a
unique group in the |
selection_group_key |
Character vector of column names that identify a
unique group in the |
timepoint_column |
Name of the column holding time point info? |
by_subject |
Should calculations be performed for each subject separately? |
subject_column |
Name of the column holding subjects information.
Relevant only if |
A list of data frames or a data frame
Other Analysis functions:
CIS_grubbs()
,
HSC_population_size_estimate()
,
compute_abundance()
,
cumulative_is()
,
gene_frequency_fisher()
,
is_sharing()
,
sample_statistics()
,
top_integrations()
,
top_targeted_genes()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) df1 <- aggreg |> dplyr::filter(.data$Tissue == "BM") df2 <- aggreg |> dplyr::filter(.data$Tissue == "PB") source <- iss_source(df1, df2) source ggplot2::ggplot(source$PT001, ggplot2::aes( x = as.factor(g2_TimePoint), y = sharing_perc, fill = g1 )) + ggplot2::geom_col() + ggplot2::labs( x = "Time point", y = "Shared IS % with MNC BM", title = "Source of is MNC BM vs MNC PB" )
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) df1 <- aggreg |> dplyr::filter(.data$Tissue == "BM") df2 <- aggreg |> dplyr::filter(.data$Tissue == "PB") source <- iss_source(df1, df2) source ggplot2::ggplot(source$PT001, ggplot2::aes( x = as.factor(g2_TimePoint), y = sharing_perc, fill = g1 )) + ggplot2::geom_col() + ggplot2::labs( x = "Time point", y = "Shared IS % with MNC BM", title = "Source of is MNC BM vs MNC PB" )
Known clinical oncogenes (for mouse and human).
known_clinical_oncogenes()
known_clinical_oncogenes()
A data frame
Other Plotting function helpers:
clinical_relevant_suspicious_genes()
known_clinical_oncogenes()
known_clinical_oncogenes()
Fetches the look-up tables for different categories of dynamic
vars.
For more details, refer to the dedicated vignette
vignette("workflow_start", package="ISAnalytics")
.
mandatory_IS_vars
returns the look-up table of variables that are
used to uniquely identify integration events
annotation_IS_vars()
returns the look-up table of variables that
contain genomic annotations
association_file_columns()
returns the look-up table of variables that
contains information on how metadata is structured
iss_stats_specs()
returns the look-up table of variables that
contains information on the format of pool statistics files produced
automatically by VISPA2
matrix_file_suffixes()
returns the look-up table of variables that
contains all default file names for each quantification type and it is
used by automated import functions
mandatory_IS_vars(include_types = FALSE) annotation_IS_vars(include_types = FALSE) association_file_columns(include_types = FALSE) iss_stats_specs(include_types = FALSE) matrix_file_suffixes()
mandatory_IS_vars(include_types = FALSE) annotation_IS_vars(include_types = FALSE) association_file_columns(include_types = FALSE) iss_stats_specs(include_types = FALSE) matrix_file_suffixes()
include_types |
If set to |
A character vector or a data frame
Other dynamic vars:
inspect_tags()
,
pcr_id_column()
,
reset_mandatory_IS_vars()
,
set_mandatory_IS_vars()
,
set_matrix_file_suffixes()
# Names only mandatory_IS_vars() # Names and types mandatory_IS_vars(TRUE) # Names only annotation_IS_vars() # Names and types annotation_IS_vars(TRUE) # Names only association_file_columns() # Names and types association_file_columns(TRUE) # Names only iss_stats_specs() # Names and types iss_stats_specs(TRUE) # Names only matrix_file_suffixes()
# Names only mandatory_IS_vars() # Names and types mandatory_IS_vars(TRUE) # Names only annotation_IS_vars() # Names and types annotation_IS_vars(TRUE) # Names only association_file_columns() # Names and types association_file_columns(TRUE) # Names only iss_stats_specs() # Names and types iss_stats_specs(TRUE) # Names only matrix_file_suffixes()
matching_opt
parameter.These are all the possible values for the matching_opt
parameter in
import_parallel_vispa2Matrices_auto
.
matching_options()
matching_options()
The values "ANY", "ALL" and "OPTIONAL", represent how the patterns should be matched, more specifically
ANY = look only for files that match AT LEAST one of the patterns specified
ALL = look only for files that match ALL of the patterns specified
OPTIONAL = look preferentially for files that match, in order, all patterns or any pattern and if no match is found return what is found (keep in mind that duplicates are discarded in automatic mode)
A vector of characters for matching_opt
import_parallel_Vispa2Matrices_auto
Other Import functions helpers:
annotation_issues()
,
date_formats()
,
default_af_transform()
,
default_iss_file_prefixes()
,
quantification_types()
opts <- matching_options()
opts <- matching_options()
Launch the shiny application NGSdataExplorer.
NGSdataExplorer()
NGSdataExplorer()
Nothing
## Not run: NGSdataExplorer() ## End(Not run)
## Not run: NGSdataExplorer() ## End(Not run)
Filter out outliers in metadata by using appropriate outlier tests.
outlier_filter( metadata, pcr_id_col = pcr_id_column(), outlier_test = c(outliers_by_pool_fragments), outlier_test_outputs = NULL, combination_logic = c("AND"), negate = FALSE, report_path = default_report_path(), ... )
outlier_filter( metadata, pcr_id_col = pcr_id_column(), outlier_test = c(outliers_by_pool_fragments), outlier_test_outputs = NULL, combination_logic = c("AND"), negate = FALSE, report_path = default_report_path(), ... )
metadata |
The metadata data frame |
pcr_id_col |
The name of the pcr identifier column |
outlier_test |
One or more outlier tests. Must be functions,
either from |
outlier_test_outputs |
|
combination_logic |
One or more logical operators ("AND", "OR", "XOR", "NAND", "NOR", "XNOR"). See datails. |
negate |
If |
report_path |
The path where the report file should be saved.
Can be a folder or |
... |
Additional named arguments passed to |
The outlier filtering functions are structured in a modular fashion. There are 2 kind of functions:
Outlier tests - Functions that perform some kind of calculation based on inputs and flags metadata
Outlier filter - A function that takes one or more outlier tests, combines all the flags with a given logic and filters out rows that are flagged as outliers
This function acts as the filter. It can either take one or more outlier
tests as functions and call them through the argument outlier_test
,
or it can take directly outputs produced by individual tests in
the argument outlier_test_outputs
- if both are provided the second one
has priority. The second method offers a bit more freedom, since single
tests can be run independently and intermediate results saved and examined
more in detail. If more than one test is to be performed, the argument
combination_logic
tells the function how to combine the flags: you can
specify 1 logical operator or more than 1, provided it is compatible
with the number of tests.
You have the freedom to provide your own functions as outlier tests. For this purpose, functions provided must respect this guidelines:
Must take as input the whole metadata df
Must return a df containing AT LEAST the pcr_id_col
and a logical column
"to_remove"
that contains the flag
The pcr_id_col
must contain all the values originally present in the
metadata df
A data frame of metadata which has less or the same amount of rows
Other Data cleaning and pre-processing:
aggregate_metadata()
,
aggregate_values_by_key()
,
compute_near_integrations()
,
default_meta_agg()
,
outliers_by_pool_fragments()
,
purity_filter()
,
realign_after_collisions()
,
remove_collisions()
,
threshold_filter()
data("association_file", package = "ISAnalytics") filtered_af <- outlier_filter(association_file, key = "BARCODE_MUX", report_path = NULL ) head(filtered_af)
data("association_file", package = "ISAnalytics") filtered_af <- outlier_filter(association_file, key = "BARCODE_MUX", report_path = NULL ) head(filtered_af)
Identify and flag outliers based on expected number of raw reads per pool.
outliers_by_pool_fragments( metadata, key = "BARCODE_MUX", outlier_p_value_threshold = 0.01, normality_test = FALSE, normality_p_value_threshold = 0.05, transform_log2 = TRUE, per_pool_test = TRUE, pool_col = "PoolID", min_samples_per_pool = 5, flag_logic = "AND", keep_calc_cols = TRUE, report_path = default_report_path() )
outliers_by_pool_fragments( metadata, key = "BARCODE_MUX", outlier_p_value_threshold = 0.01, normality_test = FALSE, normality_p_value_threshold = 0.05, transform_log2 = TRUE, per_pool_test = TRUE, pool_col = "PoolID", min_samples_per_pool = 5, flag_logic = "AND", keep_calc_cols = TRUE, report_path = default_report_path() )
metadata |
The metadata data frame |
key |
A character vector of numeric column names |
outlier_p_value_threshold |
The p value threshold for a read to be considered an outlier |
normality_test |
Perform normality test? Normality is assessed for each column in the key using Shapiro-Wilk test and if the values do not follow a normal distribution, other calculations are skipped |
normality_p_value_threshold |
Normality threshold |
transform_log2 |
Perform a log2 trasformation on values prior the actual calculations? |
per_pool_test |
Perform the test for each pool? |
pool_col |
A character vector of the names of the columns that uniquely identify a pool |
min_samples_per_pool |
The minimum number of samples that a pool
needs to contain in order to be processed - relevant only if
|
flag_logic |
A character vector of logic operators to obtain a global flag formula - only relevant if the key is longer than one. All operators must be chosen between: AND, OR, XOR, NAND, NOR, XNOR |
keep_calc_cols |
Keep the calculation columns in the output data frame? |
report_path |
The path where the report file should be saved.
Can be a folder, a file or NULL if no report should be produced.
Defaults to |
The outlier filtering functions are structured in a modular fashion. There are 2 kind of functions:
Outlier tests - Functions that perform some kind of calculation based on inputs and flags metadata
Outlier filter - A function that takes one or more outlier tests, combines all the flags with a given logic and filters out rows that are flagged as outliers
This function is an outlier test, and calculates for each column in the key
The zscore of the values
The tstudent of the values
The the associated p-value (tdist)
Optionally the test can be performed for each pool and a normality test can be run prior the actual calculations. Samples are flagged if this condition is respected:
tdist < outlier_p_value_threshold & zscore < 0
If the key contains more than one column an additional flag logic can be
specified for combining the results.
Example:
let's suppose the key contains the names of two columns, X and Y
key = c("X", "Y")
if we specify the the argument flag_logic = "AND"
then the reads will
be flagged based on this global condition:
(tdist_X < outlier_p_value_threshold & zscore_X < 0) AND
(tdist_Y < outlier_p_value_threshold & zscore_Y < 0)
The user can specify one or more logical operators that will be applied in sequence.
A data frame of metadata with the column to_remove
Other Data cleaning and pre-processing:
aggregate_metadata()
,
aggregate_values_by_key()
,
compute_near_integrations()
,
default_meta_agg()
,
outlier_filter()
,
purity_filter()
,
realign_after_collisions()
,
remove_collisions()
,
threshold_filter()
data("association_file", package = "ISAnalytics") flagged <- outliers_by_pool_fragments(association_file, report_path = NULL ) head(flagged)
data("association_file", package = "ISAnalytics") flagged <- outliers_by_pool_fragments(association_file, report_path = NULL ) head(flagged)
The function is a shortcut to retrieve the currently set pcr id column name from the association file column tags look-up table. This column is needed every time a joining operation with metadata needs to be performed
pcr_id_column()
pcr_id_column()
The name of the column
Other dynamic vars:
inspect_tags()
,
mandatory_IS_vars()
,
reset_mandatory_IS_vars()
,
set_mandatory_IS_vars()
,
set_matrix_file_suffixes()
pcr_id_column()
pcr_id_column()
The file is simply a result of a research with the keywords "proto-oncogenes" and "tumor suppressor" for the target genomes on UniProt database.
data("proto_oncogenes") data("tumor_suppressors")
data("proto_oncogenes") data("tumor_suppressors")
An object of class tbl_df
(inherits from tbl
, data.frame
) with 569 rows and 13 columns.
An object of class tbl_df
(inherits from tbl
, data.frame
) with 523 rows and 13 columns.
tumor_suppressors
: Data frame for tumor suppressor genes
Filter that targets possible contamination between cell lines based on a numeric quantification (likely abundance or sequence count).
purity_filter( x, lineages = blood_lineages_default(), aggregation_key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), group_key = c("CellMarker", "Tissue"), selected_groups = NULL, join_on = "CellMarker", min_value = 3, impurity_threshold = 10, by_timepoint = TRUE, timepoint_column = "TimePoint", value_column = "seqCount_sum" )
purity_filter( x, lineages = blood_lineages_default(), aggregation_key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), group_key = c("CellMarker", "Tissue"), selected_groups = NULL, join_on = "CellMarker", min_value = 3, impurity_threshold = 10, by_timepoint = TRUE, timepoint_column = "TimePoint", value_column = "seqCount_sum" )
x |
An aggregated integration matrix, obtained via
|
lineages |
A data frame containing cell lineages information |
aggregation_key |
The key used for aggregating |
group_key |
A character vector of column names for re-aggregation.
Column names must be either in |
selected_groups |
Either NULL, a character vector or a data frame for group selection. See details. |
join_on |
Common columns to perform a join operation on |
min_value |
A minimum value to filter the input matrix. Integrations
with a value strictly lower than |
impurity_threshold |
The ratio threshold for impurity in groups |
by_timepoint |
Should filtering be applied on each time point? If
|
timepoint_column |
Column in |
value_column |
Column in |
The input matrix can be re-aggregated with the provided group_key
argument. This key contains the names of the columns to group on
(besides the columns holding genomic coordinates of the integration
sites) and must be contained in at least one of x
or lineages
data frames. If the key is not found only in x
, then a join operation
with the lineages
data frame is performed on the common column(s)
join_on
.
It is possible for the user to specify on which groups the logic of the
filter should be applied to. For example: if we have
group_key = c("HematoLineage")
and we set
selected_groups = c("CD34", "Myeloid","Lymphoid")
it means that a single integration will be evaluated for the filter only
for groups that have the values of "CD34", "Myeloid" and "Lymphoid" in
the "HematoLineage" column.
If the same integration is present in other groups it is
kept as it is. selected_groups
can be set to NULL
if we want
the logic to apply to every group present in the data frame,
it can be set as a simple character vector as the example above if
the group key has length 1 (and there is no need to filter on time point).
If the group key is longer than 1 then the filter is applied only on the
first element of the key.
If a more refined selection on groups is needed, a data frame can be provided instead:
group_key = c("CellMarker", "Tissue") selected_groups = tibble::tribble( ~ CellMarker, ~ Tissue, "CD34", "BM", "CD14", "BM", "CD14", "PB" )
Columns in the data frame should be the same as group key (plus, eventually, the time point column). In this example only those groups identified by the rows in the provided data frame are processed.
A data frame
Other Data cleaning and pre-processing:
aggregate_metadata()
,
aggregate_values_by_key()
,
compute_near_integrations()
,
default_meta_agg()
,
outlier_filter()
,
outliers_by_pool_fragments()
,
realign_after_collisions()
,
remove_collisions()
,
threshold_filter()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) filtered_by_purity <- purity_filter( x = aggreg, value_column = "seqCount_sum" ) head(filtered_by_purity)
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) filtered_by_purity <- purity_filter( x = aggreg, value_column = "seqCount_sum" ) head(filtered_by_purity)
quantification_type
parameter.These are all the possible values for the
quantification_type
parameter in
import_parallel_vispa2Matrices_interactive
and
import_parallel_vispa2Matrices_auto
.
quantification_types()
quantification_types()
The possible values are:
fragmentEstimate
seqCount
barcodeCount
cellCount
ShsCount
A vector of characters for quantification types
import_parallel_Vispa2Matrices_interactive
,
import_parallel_Vispa2Matrices_auto
Other Import functions helpers:
annotation_issues()
,
date_formats()
,
default_af_transform()
,
default_iss_file_prefixes()
,
matching_options()
quant_types <- quantification_types()
quant_types <- quantification_types()
This function should be used to keep data consistent among the same analysis:
if for some reason you removed the collisions by passing only the sequence
count matrix to remove_collisions()
, you should call this
function afterwards, providing a list of other quantification matrices.
NOTE: if you provided a list of several quantification types to
remove_collisions()
before, there is no need to call this function.
realign_after_collisions( sc_matrix, other_matrices, sample_column = pcr_id_column() )
realign_after_collisions( sc_matrix, other_matrices, sample_column = pcr_id_column() )
sc_matrix |
The sequence count matrix already processed for collisions
via |
other_matrices |
A named list of matrices to re-align. Names in the list
must be quantification types ( |
sample_column |
The name of the column containing the sample identifier |
For more details on how to use collision removal functionality:
vignette("workflow_start", package = "ISAnalytics")
A named list with re-aligned matrices
Other Data cleaning and pre-processing:
aggregate_metadata()
,
aggregate_values_by_key()
,
compute_near_integrations()
,
default_meta_agg()
,
outlier_filter()
,
outliers_by_pool_fragments()
,
purity_filter()
,
remove_collisions()
,
threshold_filter()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") separated <- separate_quant_matrices( integration_matrices ) no_coll <- remove_collisions( x = separated$seqCount, association_file = association_file, quant_cols = c(seqCount = "Value"), report_path = NULL ) realigned <- realign_after_collisions( sc_matrix = no_coll, other_matrices = list(fragmentEstimate = separated$fragmentEstimate) ) realigned
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") separated <- separate_quant_matrices( integration_matrices ) no_coll <- remove_collisions( x = separated$seqCount, association_file = association_file, quant_cols = c(seqCount = "Value"), report_path = NULL ) realigned <- realign_after_collisions( sc_matrix = no_coll, other_matrices = list(fragmentEstimate = separated$fragmentEstimate) ) realigned
Selection of column names from the association file to be considered for
Vispa2 launch.
NOTE: the TagID
column appears only once but needs to be
repeated twice for generating the launch file. Use the appropriate
function to generate the file automatically.
reduced_AF_columns()
reduced_AF_columns()
A character vector
reduced_AF_columns()
reduced_AF_columns()
Required columns for refGene file.
refGene_table_cols()
refGene_table_cols()
Character vector of column names
refGene_table_cols()
refGene_table_cols()
This file was obtained following this steps:
Download from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/ the refGene.sql, knownGene.sql, knownToRefSeq.sql, kgXref.sql tables
Import everything it in mysql
Generate views for annotation:
SELECT kg.`chrom`, min(kg.cdsStart) as CDS_minStart, max(kg.`cdsEnd`) as CDS_maxEnd, k2a.geneSymbol, kg.`strand` as GeneStrand, min(kg.txStart) as TSS_minStart, max(kg.txEnd) as TSS_maxStart, kg.proteinID as ProteinID, k2a.protAcc as ProteinAcc, k2a.spDisplayID FROM `knownGene` AS kg JOIN kgXref AS k2a ON BINARY kg.name = k2a.kgID COLLATE latin1_bin -- latin1_swedish_ci -- WHERE k2a.spDisplayID IS NOT NULL and (k2a.`geneSymbol` LIKE 'Tcra%' or k2a.`geneSymbol` LIKE 'TCRA%') WHERE (k2a.spDisplayID IS NOT NULL or k2a.spDisplayID NOT LIKE '') and k2a.`geneSymbol` LIKE 'Tcra%' group by kg.`chrom`, k2a.geneSymbol ORDER BY kg.chrom ASC , kg.txStart ASC
data("refGenes_hg19") data("refGenes_mm9")
data("refGenes_hg19") data("refGenes_mm9")
An object of class tbl_df
(inherits from tbl
, data.frame
) with 27275 rows and 12 columns.
An object of class tbl_df
(inherits from tbl
, data.frame
) with 24487 rows and 12 columns.
refGenes_mm9
: Data frame for murine mm9 genome
A collision is an integration (aka a unique combination of the provided
mandatory_IS_vars()
) which is observed in more than one
independent sample.
The function tries to decide to which independent sample should
an integration event be assigned to, and if no
decision can be taken, the integration is completely removed from the data
frame.
For more details refer to the vignette "Collision removal functionality":
vignette("workflow_start", package = "ISAnalytics")
remove_collisions( x, association_file, independent_sample_id = c("ProjectID", "SubjectID"), date_col = "SequencingDate", reads_ratio = 10, quant_cols = c(seqCount = "seqCount", fragmentEstimate = "fragmentEstimate"), report_path = default_report_path(), max_workers = NULL )
remove_collisions( x, association_file, independent_sample_id = c("ProjectID", "SubjectID"), date_col = "SequencingDate", reads_ratio = 10, quant_cols = c(seqCount = "seqCount", fragmentEstimate = "fragmentEstimate"), report_path = default_report_path(), max_workers = NULL )
x |
Either a multi-quantification matrix (recommended) or a named list of matrices (names must be quantification types) |
association_file |
The association file imported via
|
independent_sample_id |
A character vector of column names that identify independent samples |
date_col |
The date column that should be considered. |
reads_ratio |
A single numeric value that represents the ratio that has
to be considered when deciding between |
quant_cols |
A named character vector where names are
quantification types and
values are the names of the corresponding columns. The quantification
|
report_path |
The path where the report file should be saved.
Can be a folder or |
max_workers |
Maximum number of parallel workers to distribute the
workload. If |
Either a multi-quantification matrix or a list of data frames
The function will explicitly check for the presence of these tags:
project_id
pool_id
pcr_replicate
Other Data cleaning and pre-processing:
aggregate_metadata()
,
aggregate_values_by_key()
,
compute_near_integrations()
,
default_meta_agg()
,
outlier_filter()
,
outliers_by_pool_fragments()
,
purity_filter()
,
realign_after_collisions()
,
threshold_filter()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") no_coll <- remove_collisions( x = integration_matrices, association_file = association_file, report_path = NULL ) head(no_coll)
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") no_coll <- remove_collisions( x = integration_matrices, association_file = association_file, report_path = NULL ) head(no_coll)
Reverts all changes to dynamic vars to the default values.
For more details, refer to the dedicated vignette
vignette("workflow_start", package="ISAnalytics")
.
reset_mandatory_IS_vars()
re-sets the look-up table for
mandatory IS vars.
reset_annotation_IS_vars()
re-sets the look-up table for
genomic annotation IS vars.
reset_af_columns_def()
re-sets the look-up table for
association file columns vars
reset_iss_stats_specs()
re-sets the look-up table for VISPA2 pool
statistics vars
reset_matrix_file_suffixes()
re-sets the matrix file suffixes look-up
table
reset_dyn_vars_config()
re-sets all look-up tables
reset_mandatory_IS_vars() reset_annotation_IS_vars() reset_af_columns_def() reset_iss_stats_specs() reset_matrix_file_suffixes() reset_dyn_vars_config()
reset_mandatory_IS_vars() reset_annotation_IS_vars() reset_af_columns_def() reset_iss_stats_specs() reset_matrix_file_suffixes() reset_dyn_vars_config()
NULL
Other dynamic vars:
inspect_tags()
,
mandatory_IS_vars()
,
pcr_id_column()
,
set_mandatory_IS_vars()
,
set_matrix_file_suffixes()
reset_mandatory_IS_vars() reset_annotation_IS_vars() reset_af_columns_def() reset_iss_stats_specs() reset_matrix_file_suffixes() reset_dyn_vars_config()
reset_mandatory_IS_vars() reset_annotation_IS_vars() reset_af_columns_def() reset_iss_stats_specs() reset_matrix_file_suffixes() reset_dyn_vars_config()
The function operates on a data frame by grouping the content by
the sample key and computing every function specified on every
column in the value_columns
parameter. After that the metadata
data frame is updated by including the computed results as columns
for the corresponding key.
For this reason it's required that both x
and metadata
have the
same sample key, and it's particularly important if the user is
working with previously aggregated data.
For example:
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) aggreg_meta <- aggregate_metadata(association_file = association_file) sample_stats <- sample_statistics(x = aggreg, metadata = aggreg_meta, value_columns = c("seqCount", "fragmentEstimate"), sample_key = c("SubjectID", "CellMarker","Tissue", "TimePoint"))
sample_statistics( x, metadata, sample_key = "CompleteAmplificationID", value_columns = "Value", functions = default_stats(), add_integrations_count = TRUE )
sample_statistics( x, metadata, sample_key = "CompleteAmplificationID", value_columns = "Value", functions = default_stats(), add_integrations_count = TRUE )
x |
A data frame |
metadata |
The metadata data frame |
sample_key |
Character vector representing the key for identifying a sample |
value_columns |
The name of the columns to be computed, must be numeric or integer |
functions |
A named list of function or purrr-style lambdas |
add_integrations_count |
Add the count of distinct integration sites
for each group? Can be computed only if |
A list with modified x and metadata data frames
The function will explicitly check for the presence of these tags:
All columns declared in mandatory_IS_vars()
These are checked only if add_integrations_count = TRUE
.
Other Analysis functions:
CIS_grubbs()
,
HSC_population_size_estimate()
,
compute_abundance()
,
cumulative_is()
,
gene_frequency_fisher()
,
is_sharing()
,
iss_source()
,
top_integrations()
,
top_targeted_genes()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") stats <- sample_statistics( x = integration_matrices, metadata = association_file, value_columns = c("seqCount", "fragmentEstimate") ) stats
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") stats <- sample_statistics( x = integration_matrices, metadata = association_file, value_columns = c("seqCount", "fragmentEstimate") ) stats
The function separates a single multi-quantification integration matrix, obtained via comparison_matrix, into single quantification matrices as a named list of tibbles.
separate_quant_matrices( x, fragmentEstimate = "fragmentEstimate", seqCount = "seqCount", barcodeCount = "barcodeCount", cellCount = "cellCount", ShsCount = "ShsCount", key = c(mandatory_IS_vars(), annotation_IS_vars(), "CompleteAmplificationID") )
separate_quant_matrices( x, fragmentEstimate = "fragmentEstimate", seqCount = "seqCount", barcodeCount = "barcodeCount", cellCount = "cellCount", ShsCount = "ShsCount", key = c(mandatory_IS_vars(), annotation_IS_vars(), "CompleteAmplificationID") )
x |
Single integration matrix with multiple quantification value columns, obtained via comparison_matrix. |
fragmentEstimate |
Name of the fragment estimate values column in input |
seqCount |
Name of the sequence count values column in input |
barcodeCount |
Name of the barcode count values column in input |
cellCount |
Name of the cell count values column in input |
ShsCount |
Name of the shs count values column in input |
key |
Key columns to perform the joining operation |
A named list of data frames, where names are quantification types
Other Utilities:
as_sparse_matrix()
,
comparison_matrix()
,
enable_progress_bars()
,
export_ISA_settings()
,
generate_Vispa2_launch_AF()
,
generate_blank_association_file()
,
generate_default_folder_structure()
,
import_ISA_settings()
,
transform_columns()
data("integration_matrices", package = "ISAnalytics") separated <- separate_quant_matrices( integration_matrices )
data("integration_matrices", package = "ISAnalytics") separated <- separate_quant_matrices( integration_matrices )
This set of function allows users to specify custom look-up tables for
dynamic variables.
For more details, refer to the dedicated vignette
vignette("workflow_start", package="ISAnalytics")
.
set_mandatory_IS_vars()
sets the look-up table for mandatory IS vars.
set_annotation_IS_vars()
sets the look-up table for genomic annotation
IS vars.
set_af_columns_def()
sets the look-up table for association file columns
vars
set_iss_stats_specs()
sets the look-up table for VISPA2 pool statistics
vars
set_mandatory_IS_vars(specs) set_annotation_IS_vars(specs) set_af_columns_def(specs) set_iss_stats_specs(specs)
set_mandatory_IS_vars(specs) set_annotation_IS_vars(specs) set_af_columns_def(specs) set_iss_stats_specs(specs)
specs |
Either a named vector or a data frame with specific format. See details. |
The user can supply specifications in the form of a named vector or a data frame.
When using a named vector, names should be the names of the columns,
values should be the type associated with each column in the form
of a string. The vector gets automatically converted into a data frame
with the right format (default values for the columns transform
and
flag
are NULL
and required
respectively). Use of this method is
however discouraged: data frame inputs are preferred since they offer more
control.
The look-up table for dynamic vars should always follow this structure:
names | types | transform | flag | tag |
<name of the column> |
<type> |
<a lambda or NULL> |
<flag> |
<tag> |
where
names
contains the name of the column as a character
types
contains the type of the column. Type should be expressed as a
string and should be in one of the allowed types
char
for character (strings)
int
for integers
logi
for logical values (TRUE / FALSE)
numeric
for numeric values
factor
for factors
date
for generic date format - note that functions that
need to read and parse files will try to guess the format and parsing
may fail
One of the accepted date/datetime formats by lubridate
,
you can use ISAnalytics::date_formats()
to view the accepted formats
transform
: a purrr-style lambda that is applied immediately after
importing.
This is useful to operate simple transformations like removing unwanted
characters or rounding to a certain precision. Please note that these lambdas
need to be functions that accept a vector as input and only operate a
transformation, aka they output a vector of the same length as the
input. For more complicated applications that may require the value of other
columns, appropriate functions should be manually applied post-import.
flag
: as of now, it should be set either to required
or optional
-
some functions internally check for only required tags presence and if those
are missing from inputs they fail, signaling failure to the user
tag
: a specific tag expressed as a string
Type should be expressed as a string and should be in one of the allowed types
char
for character (strings)
int
for integers
logi
for logical values (TRUE / FALSE)
numeric
for numeric values
factor
for factors
date
for generic date format - note that functions that
need to read and parse files will try to guess the format and parsing
may fail
One of the accepted date/datetime formats by lubridate
,
you can use ISAnalytics::date_formats()
to view the accepted formats
NULL
Other dynamic vars:
inspect_tags()
,
mandatory_IS_vars()
,
pcr_id_column()
,
reset_mandatory_IS_vars()
,
set_matrix_file_suffixes()
tmp_mand_vars <- tibble::tribble( ~names, ~types, ~transform, ~flag, ~tag, "chrom", "char", ~ stringr::str_replace_all(.x, "chr", ""), "required", "chromosome", "position", "int", NULL, "required", "locus", "strand", "char", NULL, "required", "is_strand", "gap", "int", NULL, "required", NA_character_, "junction", "int", NULL, "required", NA_character_ ) set_mandatory_IS_vars(tmp_mand_vars) print(mandatory_IS_vars(TRUE)) reset_mandatory_IS_vars() tmp_annot_vars <- tibble::tribble( ~names, ~types, ~transform, ~flag, ~tag, "gene", "char", NULL, "required", "gene_symbol", "gene_strand", "char", NULL, "required", "gene_strand" ) print(annotation_IS_vars(TRUE)) reset_annotation_IS_vars() temp_af_cols <- tibble::tribble( ~names, ~types, ~transform, ~flag, ~tag, "project", "char", NULL, "required", "project_id", "pcr_id", "char", NULL, "required", "pcr_repl_id", "subject", "char", NULL, "required", "subject" ) set_af_columns_def(temp_af_cols) print(association_file_columns(TRUE)) reset_af_columns_def() tmp_iss_vars <- tibble::tribble( ~names, ~types, ~transform, ~flag, ~tag, "pool", "char", NULL, "required", "vispa_concatenate", "tag", "char", NULL, "required", "tag_seq", "barcode", "int", NULL, "required", NA_character_ ) set_iss_stats_specs(tmp_iss_vars) iss_stats_specs(TRUE) reset_iss_stats_specs()
tmp_mand_vars <- tibble::tribble( ~names, ~types, ~transform, ~flag, ~tag, "chrom", "char", ~ stringr::str_replace_all(.x, "chr", ""), "required", "chromosome", "position", "int", NULL, "required", "locus", "strand", "char", NULL, "required", "is_strand", "gap", "int", NULL, "required", NA_character_, "junction", "int", NULL, "required", NA_character_ ) set_mandatory_IS_vars(tmp_mand_vars) print(mandatory_IS_vars(TRUE)) reset_mandatory_IS_vars() tmp_annot_vars <- tibble::tribble( ~names, ~types, ~transform, ~flag, ~tag, "gene", "char", NULL, "required", "gene_symbol", "gene_strand", "char", NULL, "required", "gene_strand" ) print(annotation_IS_vars(TRUE)) reset_annotation_IS_vars() temp_af_cols <- tibble::tribble( ~names, ~types, ~transform, ~flag, ~tag, "project", "char", NULL, "required", "project_id", "pcr_id", "char", NULL, "required", "pcr_repl_id", "subject", "char", NULL, "required", "subject" ) set_af_columns_def(temp_af_cols) print(association_file_columns(TRUE)) reset_af_columns_def() tmp_iss_vars <- tibble::tribble( ~names, ~types, ~transform, ~flag, ~tag, "pool", "char", NULL, "required", "vispa_concatenate", "tag", "char", NULL, "required", "tag_seq", "barcode", "int", NULL, "required", NA_character_ ) set_iss_stats_specs(tmp_iss_vars) iss_stats_specs(TRUE) reset_iss_stats_specs()
The function automatically produces and sets a look-up table of matrix file suffixes based on user input.
set_matrix_file_suffixes( quantification_suffix = list(seqCount = "seqCount", fragmentEstimate = "fragmentEstimate", barcodeCount = "barcodeCount", cellCount = "cellCount", ShsCount = "ShsCount"), annotation_suffix = list(annotated = ".no0.annotated", not_annotated = ""), file_ext = "tsv.gz", glue_file_spec = "{quantification_suffix}_matrix{annotation_suffix}.{file_ext}" )
set_matrix_file_suffixes( quantification_suffix = list(seqCount = "seqCount", fragmentEstimate = "fragmentEstimate", barcodeCount = "barcodeCount", cellCount = "cellCount", ShsCount = "ShsCount"), annotation_suffix = list(annotated = ".no0.annotated", not_annotated = ""), file_ext = "tsv.gz", glue_file_spec = "{quantification_suffix}_matrix{annotation_suffix}.{file_ext}" )
quantification_suffix |
A named list - names must be quantification
types in |
annotation_suffix |
A named list - names must be |
file_ext |
The file extension (e.g. |
glue_file_spec |
A string specifying the pattern used to form the
entire suffix, as per |
NULL
Other dynamic vars:
inspect_tags()
,
mandatory_IS_vars()
,
pcr_id_column()
,
reset_mandatory_IS_vars()
,
set_mandatory_IS_vars()
set_matrix_file_suffixes( quantification_suffix = list( seqCount = "sc", fragmentEstimate = "fe", barcodeCount = "barcodeCount", cellCount = "cellCount", ShsCount = "ShsCount" ), annotation_suffix = list(annotated = "annot", not_annotated = "") ) matrix_file_suffixes() reset_matrix_file_suffixes()
set_matrix_file_suffixes( quantification_suffix = list( seqCount = "sc", fragmentEstimate = "fe", barcodeCount = "barcodeCount", cellCount = "cellCount", ShsCount = "ShsCount" ), annotation_suffix = list(annotated = "annot", not_annotated = "") ) matrix_file_suffixes() reset_matrix_file_suffixes()
Displays the IS sharing calculated via is_sharing as heatmaps.
sharing_heatmap( sharing_df, show_on_x = "g1", show_on_y = "g2", absolute_sharing_col = "shared", title_annot = NULL, plot_relative_sharing = TRUE, rel_sharing_col = c("on_g1", "on_union"), show_perc_symbol_rel = TRUE, interactive = FALSE )
sharing_heatmap( sharing_df, show_on_x = "g1", show_on_y = "g2", absolute_sharing_col = "shared", title_annot = NULL, plot_relative_sharing = TRUE, rel_sharing_col = c("on_g1", "on_union"), show_perc_symbol_rel = TRUE, interactive = FALSE )
sharing_df |
The data frame containing the IS sharing data |
show_on_x |
Name of the column to plot on the x axis |
show_on_y |
Name of the column to plot on the y axis |
absolute_sharing_col |
Name of the column that contains the absolute values of IS sharing |
title_annot |
Additional text to display in the title |
plot_relative_sharing |
Logical. Compute heatmaps also for relative sharing? |
rel_sharing_col |
Names of the columns to consider as relative sharing. The function is going to plot one heatmap per column in this argument. |
show_perc_symbol_rel |
Logical. Only relevant if |
interactive |
Logical. Requires the package plotly is required for this functionality. Returns the heatmaps as interactive HTML widgets. |
A list of plots or widgets
Other Plotting functions:
CIS_volcano_plot()
,
HSC_population_plot()
,
circos_genomic_density()
,
fisher_scatterplot()
,
integration_alluvial_plot()
,
sharing_venn()
,
top_abund_tableGrob()
,
top_cis_overtime_heatmap()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) sharing <- is_sharing(aggreg, minimal = FALSE, include_self_comp = TRUE ) sharing_heatmaps <- sharing_heatmap(sharing_df = sharing) sharing_heatmaps$absolute sharing_heatmaps$on_g1 sharing_heatmaps$on_union
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) sharing <- is_sharing(aggreg, minimal = FALSE, include_self_comp = TRUE ) sharing_heatmaps <- sharing_heatmap(sharing_df = sharing) sharing_heatmaps$absolute sharing_heatmaps$on_g1 sharing_heatmaps$on_union
This function processes a sharing data frame obtained via is_sharing()
with the option table_for_venn = TRUE
to obtain a list of objects
that can be plotted as venn or euler diagrams.
sharing_venn(sharing_df, row_range = NULL, euler = TRUE)
sharing_venn(sharing_df, row_range = NULL, euler = TRUE)
sharing_df |
The sharing data frame |
row_range |
Either |
euler |
If |
The functions requires the package eulerr. Each row of the input data frame is representable as a venn/euler diagram. The function allows to specify a range of row indexes to obtain a list of plottable objects all at once, leave it to NULL to process all rows.
To actually plot the data it is sufficient to call the function plot()
and specify optional customization arguments. See
eulerr docs
for more detail on this.
A list of data frames
Other Plotting functions:
CIS_volcano_plot()
,
HSC_population_plot()
,
circos_genomic_density()
,
fisher_scatterplot()
,
integration_alluvial_plot()
,
sharing_heatmap()
,
top_abund_tableGrob()
,
top_cis_overtime_heatmap()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) sharing <- is_sharing(aggreg, n_comp = 3, table_for_venn = TRUE) venn_tbls <- sharing_venn(sharing, row_range = 1:3, euler = FALSE) venn_tbls plot(venn_tbls[[1]])
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) sharing <- is_sharing(aggreg, n_comp = 3, table_for_venn = TRUE) venn_tbls <- sharing_venn(sharing, row_range = 1:3, euler = FALSE) venn_tbls plot(venn_tbls[[1]])
Produce summary tableGrobs as R graphics. For this functionality the suggested package gridExtra is required. To visualize the resulting object:
gridExtra::grid.arrange(tableGrob)
top_abund_tableGrob( df, id_cols = mandatory_IS_vars(), quant_col = "fragmentEstimate_sum_PercAbundance", by = "TimePoint", alluvial_plot = NULL, top_n = 10, tbl_cols = "GeneName", include_id_cols = FALSE, digits = 2, perc_symbol = TRUE, transform_by = NULL )
top_abund_tableGrob( df, id_cols = mandatory_IS_vars(), quant_col = "fragmentEstimate_sum_PercAbundance", by = "TimePoint", alluvial_plot = NULL, top_n = 10, tbl_cols = "GeneName", include_id_cols = FALSE, digits = 2, perc_symbol = TRUE, transform_by = NULL )
df |
A data frame |
id_cols |
Character vector of id column names. To plot after alluvial,
these columns must be the same as the |
quant_col |
Column name holding the quantification value.
To plot after alluvial,
these columns must be the same as the |
by |
The column name to subdivide tables for. The function
will produce one table for each distinct value in |
alluvial_plot |
Either NULL or an alluvial plot for color mapping between values of y. |
top_n |
Integer. How many rows should the table contain at most? |
tbl_cols |
Table columns to show in the final output besides
|
include_id_cols |
Logical. Include |
digits |
Integer. Digits to show for the quantification column |
perc_symbol |
Logical. Show percentage symbol in the quantification column? |
transform_by |
Either a function or a purrr-style lambda. This
function is applied to the column |
A tableGrob object
Other Plotting functions:
CIS_volcano_plot()
,
HSC_population_plot()
,
circos_genomic_density()
,
fisher_scatterplot()
,
integration_alluvial_plot()
,
sharing_heatmap()
,
sharing_venn()
,
top_cis_overtime_heatmap()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) abund <- compute_abundance(x = aggreg) grob <- top_abund_tableGrob(abund) gridExtra::grid.arrange(grob) # with transform grob <- top_abund_tableGrob(abund, transform_by = ~ as.numeric(.x))
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) abund <- compute_abundance(x = aggreg) grob <- top_abund_tableGrob(abund) gridExtra::grid.arrange(grob) # with transform grob <- top_abund_tableGrob(abund, transform_by = ~ as.numeric(.x))
This function computes the visualization of the results of the function
CIS_grubbs_overtime()
in the form of heatmaps for the top N selected
genes over time.
top_cis_overtime_heatmap( x, n_genes = 20, timepoint_col = "TimePoint", group_col = "group", onco_db_file = "proto_oncogenes", tumor_suppressors_db_file = "tumor_suppressors", species = "human", known_onco = known_clinical_oncogenes(), suspicious_genes = clinical_relevant_suspicious_genes(), significance_threshold = 0.05, plot_values = c("minus_log_p", "p"), p_value_correction = c("fdr", "bonferroni"), prune_tp_treshold = 20, gene_selection_param = c("trimmed", "n", "mean", "sd", "median", "mad", "min", "max"), fill_0_selection = TRUE, fill_NA_in_heatmap = FALSE, heatmap_color_palette = "default", title_generator = NULL, save_as_files = FALSE, files_format = c("pdf", "png", "tiff", "bmp", "jpg"), folder_path = NULL, ... )
top_cis_overtime_heatmap( x, n_genes = 20, timepoint_col = "TimePoint", group_col = "group", onco_db_file = "proto_oncogenes", tumor_suppressors_db_file = "tumor_suppressors", species = "human", known_onco = known_clinical_oncogenes(), suspicious_genes = clinical_relevant_suspicious_genes(), significance_threshold = 0.05, plot_values = c("minus_log_p", "p"), p_value_correction = c("fdr", "bonferroni"), prune_tp_treshold = 20, gene_selection_param = c("trimmed", "n", "mean", "sd", "median", "mad", "min", "max"), fill_0_selection = TRUE, fill_NA_in_heatmap = FALSE, heatmap_color_palette = "default", title_generator = NULL, save_as_files = FALSE, files_format = c("pdf", "png", "tiff", "bmp", "jpg"), folder_path = NULL, ... )
x |
Output of the function |
n_genes |
Number of top genes to consider |
timepoint_col |
The name of the time point column in |
group_col |
The name of the group column in |
onco_db_file |
Uniprot file for proto-oncogenes (see details). If different from default, should be supplied as a path to a file. |
tumor_suppressors_db_file |
Uniprot file for tumor-suppressor genes. If different from default, should be supplied as a path to a file. |
species |
One between |
known_onco |
Data frame with known oncogenes. See details. |
suspicious_genes |
Data frame with clinical relevant suspicious genes. See details. |
significance_threshold |
The significance threshold |
plot_values |
Which kind of values should be plotted? Can either be
|
p_value_correction |
One among |
prune_tp_treshold |
Minimum number of genes to retain a time point. See details. |
gene_selection_param |
The descriptive statistic measure to decide
which genes to plot, possible choices are
|
fill_0_selection |
Fill NA values with 0s before computing statistics for each gene? (TRUE/FALSE) |
fill_NA_in_heatmap |
Fill NA values with 0 when plotting the heatmap? (TRUE/FALSE) |
heatmap_color_palette |
Colors for values in the heatmaps,
either |
title_generator |
Either |
save_as_files |
Should heatmaps be saved to files on disk? (TRUE/FALSE) |
files_format |
The extension of the files produced, supported
formats are |
folder_path |
Path to the folder where files will be saved |
... |
Other params to pass to |
These files are included in the package for user convenience and are
simply UniProt files with gene annotations for human and mouse.
For more details on how this files were generated use the help
?tumor_suppressors
, ?proto_oncogenes
The default values are included in this package and it can be accessed by doing:
known_clinical_oncogenes()
If the user wants to change this parameter the input data frame must
preserve the column structure. The same goes for the suspicious_genes
parameter (DOIReference column is optional):
clinical_relevant_suspicious_genes()
Since the genes present in different time point slices are likely different, the decision process to select the final top N genes to represent in the heatmap follows this logic:
Each time point slice is arranged either in ascending order (if we want to plot the p-value) or in descending order (if we want to plot the scaled p-value) and the top n genes are selected
A series of statistics are computed over the union set of genes on ALL time points (min, max, mean, ...)
A decision is taken by considering the ordered gene_selection_param
(order depends once again if the values are scaled or not), and the first
N genes are selected for plotting.
It is possible to fill NA values (aka missing combinations of GENE/TP) with 0s prior computing the descriptive statistics on which gene selection is based. Please keep in mind that this has an impact on the final result, since for computing metrics such as the mean, NA values are usually removed, decreasing the overall number of values considered - this does not hold when NA values are substituted with 0s.
Statistics are computed for each gene over all time points of each group.
More in detail, n
: counts the number of instances (rows)
in which the genes appears, aka it counts the time points in which the gene
is present. NOTE: if
fill_0_selection
option is set to TRUE
this value will be equal for
all genes! All other statistics as per the argument gene_selection_param
map to the corresponding R functions with the exception of trimmed
which
is a simple call to the mean
function with the argument trimmed = 0.1
.
It is possible to customise the appearence of the plot through different parameters.
fill_NA_in_heatmap
tells the function whether missing combinations of
GENE/TP should be plotted as NA or filled with a value (1 if p-value, 0
if scaled p-value)
A title generator function can be provided to dynamically create a title
for the plots: the function can accept two positional arguments for
the group identifier and the number of selected genes respectively. If one or
none of the arguments are of interest, they can be absorbed with ...
.
heatmap_color_palette
can be used to specify a function from which
colors are sampled (refers to the colors of values only)
To change the colors associated with annotations instead, use the
argument annotation_colors
of pheatmap::pheatmap()
- it must be set to a
list with this format:
list( KnownGeneClass = c("OncoGene" = color_spec, "Other" = color_spec, "TumSuppressor" = color_spec), ClinicalRelevance = c("TRUE" = color_spec, "FALSE" = color_spec), CriticalForInsMut = c("TRUE" = color_spec, "FALSE" = color_spec) )
Either a list of graphical objects or a list of paths where plots were saved
Other Plotting functions:
CIS_volcano_plot()
,
HSC_population_plot()
,
circos_genomic_density()
,
fisher_scatterplot()
,
integration_alluvial_plot()
,
sharing_heatmap()
,
sharing_venn()
,
top_abund_tableGrob()
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) cis_overtime <- CIS_grubbs_overtime(aggreg) hmaps <- top_cis_overtime_heatmap(cis_overtime$cis, fill_NA_in_heatmap = TRUE ) # To re-plot: # grid::grid.newpage() # grid::grid.draw(hmaps$PT001$gtable)
data("integration_matrices", package = "ISAnalytics") data("association_file", package = "ISAnalytics") aggreg <- aggregate_values_by_key( x = integration_matrices, association_file = association_file, value_cols = c("seqCount", "fragmentEstimate") ) cis_overtime <- CIS_grubbs_overtime(aggreg) hmaps <- top_cis_overtime_heatmap(cis_overtime$cis, fill_NA_in_heatmap = TRUE ) # To re-plot: # grid::grid.newpage() # grid::grid.draw(hmaps$PT001$gtable)
The input data frame will be sorted by the highest values in the columns specified and the top n rows will be returned as output. The user can choose to keep additional columns in the output by passing a vector of column names or passing 2 "shortcuts":
keep = "everything"
keeps all columns in the original data frame
keep = "nothing"
only keeps the mandatory columns
(mandatory_IS_vars()
) plus the columns in the columns
parameter.
top_integrations( x, n = 20, columns = "fragmentEstimate_sum_RelAbundance", keep = "everything", key = NULL )
top_integrations( x, n = 20, columns = "fragmentEstimate_sum_RelAbundance", keep = "everything", key = NULL )
x |
An integration matrix (data frame containing
|
n |
How many integrations should be sliced (in total or for each group)? Must be numeric or integer and greater than 0 |
columns |
Columns to use for the sorting. If more than a column is supplied primary ordering is done on the first column, secondary ordering on all other columns |
keep |
Names of the columns to keep besides |
key |
Either |
Either a data frame with at most n rows or a data frames with at most n*(number of groups) rows.
The function will explicitly check for the presence of these tags:
All columns declared in mandatory_IS_vars()
Other Analysis functions:
CIS_grubbs()
,
HSC_population_size_estimate()
,
compute_abundance()
,
cumulative_is()
,
gene_frequency_fisher()
,
is_sharing()
,
iss_source()
,
sample_statistics()
,
top_targeted_genes()
smpl <- tibble::tibble( chr = c("1", "2", "3", "4", "5", "6"), integration_locus = c(14536, 14544, 14512, 14236, 14522, 14566), strand = c("+", "+", "-", "+", "-", "+"), CompleteAmplificationID = c("ID1", "ID2", "ID1", "ID1", "ID3", "ID2"), Value = c(3, 10, 40, 2, 15, 150), Value2 = c(456, 87, 87, 9, 64, 96), Value3 = c("a", "b", "c", "d", "e", "f") ) top <- top_integrations(smpl, n = 3, columns = c("Value", "Value2"), keep = "nothing" ) top_key <- top_integrations(smpl, n = 3, columns = "Value", keep = "Value2", key = "CompleteAmplificationID" )
smpl <- tibble::tibble( chr = c("1", "2", "3", "4", "5", "6"), integration_locus = c(14536, 14544, 14512, 14236, 14522, 14566), strand = c("+", "+", "-", "+", "-", "+"), CompleteAmplificationID = c("ID1", "ID2", "ID1", "ID1", "ID3", "ID2"), Value = c(3, 10, 40, 2, 15, 150), Value2 = c(456, 87, 87, 9, 64, 96), Value3 = c("a", "b", "c", "d", "e", "f") ) top <- top_integrations(smpl, n = 3, columns = c("Value", "Value2"), keep = "nothing" ) top_key <- top_integrations(smpl, n = 3, columns = "Value", keep = "Value2", key = "CompleteAmplificationID" )
Produces a summary of the number of integration events per gene, orders the table in decreasing order and slices the first n rows - either on all the data frame or by group.
top_targeted_genes( x, n = 20, key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), consider_chr = TRUE, consider_gene_strand = TRUE, as_df = TRUE )
top_targeted_genes( x, n = 20, key = c("SubjectID", "CellMarker", "Tissue", "TimePoint"), consider_chr = TRUE, consider_gene_strand = TRUE, as_df = TRUE )
x |
An integration matrix - must be annotated |
n |
Number of rows to slice |
key |
If slice has to be performed for each group, the character
vector of column names that identify the groups. If |
consider_chr |
Logical, should the chromosome be taken into account? See details. |
consider_gene_strand |
Logical, should the gene strand be taken into account? See details. |
as_df |
If computation is performed by group, |
When producing a summary of IS by gene, there are different options that
can be chosen.
The argument consider_chr
accounts for the fact that some genes (same
gene symbol) may span more than one chromosome: if set to TRUE
counts of IS will be separated for those genes that span 2 or more
chromosomes - in other words they will be in 2 different rows of the
output table. On the contrary, if the argument is set to FALSE
,
counts will be produced in a single row.
NOTE: the function counts DISTINCT integration events, which logically corresponds to a union of sets. Be aware of the fact that counts per group and counts with different arguments might be different: if for example counts are performed by considering chromosome and there is one gene symbol with 2 different counts, the sum of those 2 will likely not be equal to the count obtained by performing the calculations without considering the chromosome.
The same reasoning can be applied for the argument consider_gene_strand
,
that takes into account the strand of the gene.
A data frame or a list of data frames
The function will explicitly check for the presence of these tags:
chromosome
locus
gene_symbol
gene_strand
Note that the tags "gene_strand" and "chromosome" are explicitly required
only if consider_chr = TRUE
and/or consider_gene_strand = TRUE
.
Other Analysis functions:
CIS_grubbs()
,
HSC_population_size_estimate()
,
compute_abundance()
,
cumulative_is()
,
gene_frequency_fisher()
,
is_sharing()
,
iss_source()
,
sample_statistics()
,
top_integrations()
data("integration_matrices", package = "ISAnalytics") top_targ <- top_targeted_genes( integration_matrices, key = NULL ) top_targ
data("integration_matrices", package = "ISAnalytics") top_targ <- top_targeted_genes( integration_matrices, key = NULL ) top_targ
This function takes a named list of purr-style lambdas where names are the names of the columns in the data frame that must be transformed. NOTE: the columns are overridden, not appended.
transform_columns(df, transf_list)
transform_columns(df, transf_list)
df |
The data frame on which transformations should be operated |
transf_list |
A named list of purrr-style lambdas, where names are column names the function should be applied to. |
Lambdas provided in input must be transformations, aka functions that take in input a vector and return a vector of the same length as the input.
If the input transformation list contains column names that are not present in the input data frame, they are simply ignored.
A data frame with transformed columns
Other Utilities:
as_sparse_matrix()
,
comparison_matrix()
,
enable_progress_bars()
,
export_ISA_settings()
,
generate_Vispa2_launch_AF()
,
generate_blank_association_file()
,
generate_default_folder_structure()
,
import_ISA_settings()
,
separate_quant_matrices()
df <- tibble::tribble( ~A, ~B, ~C, ~D, 1, 2, "a", "aa", 3, 4, "b", "bb", 5, 6, "c", "cc" ) lambdas <- list(A = ~ .x + 1, B = ~ .x + 2, C = ~ stringr::str_to_upper(.x)) transform_columns(df, lambdas)
df <- tibble::tribble( ~A, ~B, ~C, ~D, 1, 2, "a", "aa", 3, 4, "b", "bb", 5, 6, "c", "cc" ) lambdas <- list(A = ~ .x + 1, B = ~ .x + 2, C = ~ stringr::str_to_upper(.x)) transform_columns(df, lambdas)