Title: | High Performance Data Integration for Large-Scale Analyses of Incomplete Omic Profiles Using Batch-Effect Reduction Trees (BERT) |
---|---|
Description: | Provides efficient batch-effect adjustment of data with missing values. BERT orders all batch effect correction to a tree of pairwise computations. BERT allows parallelization over sub-trees. |
Authors: | Yannis Schumann [aut, cre] , Simon Schlumbohm [aut] |
Maintainer: | Yannis Schumann <[email protected]> |
License: | GPL-3 |
Version: | 1.3.3 |
Built: | 2024-11-21 03:03:23 UTC |
Source: | https://github.com/bioc/BERT |
This function is called by the BERT algorithm and should not be called by the user directly.
adjust_node(data, b1, b2, mod, combatmode, method)
adjust_node(data, b1, b2, mod, combatmode, method)
data |
Matrix or dataframe in the format (samples, features). Additional column names are "Batch", "Cov_X" (were X may be any number), "Label" and "Sample". |
b1 |
The first batch to adjust. |
b2 |
The second batch to adjust. |
mod |
Dataframe with potential covariables to use. May be emty. |
combatmode |
Integer, encoding the parameters to use for ComBat. 1 (default) par.prior = TRUE, mean.only = FALSE 2 par.prior = TRUE, mean.only = TRUE 3 par.prior = FALSE, mean.only = FALSE 4 par.prior = FALSE, mean.only = TRUE Will be ignored, if method=="limma". |
method |
Adjustment method to use. Should either be "ComBat" or "limma". "None" is also allowed for testing purposes and will yield no batch effect correction. |
A matrix/dataframe mirroring the shape of the input. The data will be batch-effect adjusted by the specified method.
This function uses ComBat or limma to adjust an entire hierarchy level.
adjustment_step(data, mod, combatmode, method)
adjustment_step(data, mod, combatmode, method)
data |
Matrix or dataframe in the format (samples, features). Additional column names are \"Batch\", \"Cov_X\" (were X may be any number), \"Label\" and \"Sample\". |
mod |
Dataframe with potential covariables to use. May be emty. |
combatmode |
Integer, encoding the parameters to use for ComBat. 1 (default) par.prior = TRUE, mean.only = FALSE 2 par.prior = TRUE, mean.only = TRUE 3 par.prior = FALSE, mean.only = FALSE 4 par.prior = FALSE, mean.only = TRUE Will be ignored, if method=="limma". |
method |
Adjustment method to use. Should either be \"ComBat\" or \"limma\". |
A matrix/dataframe mirroring the shape of the input. The data will be batch-effect adjusted by BERT.
This function uses the hierarchical BERT algorithm to adjust data with batch effects. It assumes that the data is in the format (samples, features) and that missing values are indicated by NA. An additional column labelled "Batch" should indicate the batch. Furthermore all columns named "Cov_1", "Cov_2", ... will be considered as covariate for adjustment. Columns labelled "Label" and "Sample" will be ignored, all other columns are assumed to contain data.
BERT( data, cores = NULL, combatmode = 1, corereduction = 4, stopParBatches = 2, backend = "default", method = "ComBat", qualitycontrol = TRUE, verify = TRUE, labelname = "Label", batchname = "Batch", referencename = "Reference", samplename = "Sample", covariatename = NULL, BPPARAM = NULL, assayname = NULL )
BERT( data, cores = NULL, combatmode = 1, corereduction = 4, stopParBatches = 2, backend = "default", method = "ComBat", qualitycontrol = TRUE, verify = TRUE, labelname = "Label", batchname = "Batch", referencename = "Reference", samplename = "Sample", covariatename = NULL, BPPARAM = NULL, assayname = NULL )
data |
Matrix dataframe/SummarizedExperiment in the format (samples, features). Additional column names are "Batch", "Cov_X" (were X may be any number), "Label", "Sample" and "Reference". Must contain at least two features. |
cores |
The number of cores to use for parallel adjustment. Increasing this number leads to faster adjustment, especially on Linux machines. The default is NULL, in which case the BiocParallel::bpparam() backend will be used. If an integer is given, a backend with the corresponding number of workers will be created and registered as default for usage. |
combatmode |
Integer, encoding the parameters to use for ComBat. 1 (default) par.prior = TRUE, mean.only = FALSE 2 par.prior = TRUE, mean.only = TRUE 3 par.prior = FALSE, mean.only = FALSE 4 par.prior = FALSE, mean.only = TRUE Will be ignored, if method!="ComBat". |
corereduction |
Reducing the number of workers by at least this number. Only used if cores is an integer. |
stopParBatches |
The minimum number of batches required at a hierarchy level to proceed with parallelized adjustment. If the number of batches is smaller, adjustment will be performed sequentially to avoid overheads. |
backend |
The backend to choose for communicating the data. Valid choices are "default" and "file". The latter will use temp files for communicating data chunks between the processes. after adjusting all sub-trees as far as possible with the previous number of cores. |
method |
Adjustment method to use. Should either be "ComBat", "limma" or "ref". Also allows "None" for testing purposes, which will perform no BE adjustment |
qualitycontrol |
Boolean indicating, whether ASWs should be computed before and after batch effect adjustment. If TRUE, will compute ASW with respect to the "Batch" and "Label" column (if existent). |
verify |
Whether the input matrix/dataframe needs to be verified before adjustment (faster if FALSE) |
labelname |
A string containing the name of the column to use as class labels. The default is "Label". |
batchname |
A string containing the name of the column to use as batch labels. The default is "Batch". |
referencename |
A string containing the name of the column to use as ref. labels. The default is "Reference". |
samplename |
A string containing the name of the column to use as sample name. The default is "Sample". |
covariatename |
A vector containing the names of columns with categorical covariables. The default is NULL, for which all columns with the pattern "Cov" will be selected. |
BPPARAM |
An instance of BiocParallelParam that will be used for parallelization. The default is null, in which case the value of cores determines the behaviour of BERT. |
assayname |
User-defined string that specifies, which assay to select, if the input data is a SummarizedExperiment. The default is NULL. |
A matrix/dataframe/SummarizedExperiment mirroring the shape of the input. The data will be batch-effect adjusted by BERT.
# generate dataset with 1000 features, 5 batches, 10 samples per batch and # two genotypes data = generate_dataset(1000,5,10,0.1, 2) corrected = BERT(data, cores=2)
# generate dataset with 1000 features, 5 batches, 10 samples per batch and # two genotypes data = generate_dataset(1000,5,10,0.1, 2) corrected = BERT(data, cores=2)
Chunks data into n segments with (close-to) equivalent number of batches and stores them in temporary RDS files
chunk_data(data, n, backend = "default")
chunk_data(data, n, backend = "default")
data |
Dataframe with the data to adjust |
n |
The number of chunks to create |
backend |
The backend to choose for communicating the data, Valid choices are "default" and "file". The latter will use temp files for communicating data chunks between the processes. |
Vector with the absolute paths to the temporary files, where the data is stored
Columns labelled Batch, Sample, Label, Reference and Cov_1 will be ignored.
compute_asw(dataset)
compute_asw(dataset)
dataset |
Dataframe in the shape (samples, features) with additional columns Batch and Label. |
List with fields "Label" and "Batch" for the ASW with regards to Label and Batch respectively.
# generate dataset with 1000 features, 5 batches, 10 samples per batch and # two genotypes data = generate_dataset(1000,5,10,0.1, 2) asw = compute_asw(data) asw
# generate dataset with 1000 features, 5 batches, 10 samples per batch and # two genotypes data = generate_dataset(1000,5,10,0.1, 2) asw = compute_asw(data) asw
Count the number of numeric features in this dataset. Columns labeled "Batch", "Sample" or "Label" will be ignored.
count_existing(dataset)
count_existing(dataset)
dataset |
Dataframe in the shape (samples, features) with optional columns "Batch", "Sample" or "Label". |
Integer indicating the number of numeric values
# generate dataset with 1000 features, 5 batches, 10 samples per batch and # two genotypes data = generate_dataset(1000,5,10, 0.1, 2) count_existing(data)
# generate dataset with 1000 features, 5 batches, 10 samples per batch and # two genotypes data = generate_dataset(1000,5,10, 0.1, 2) count_existing(data)
This function is called automatically by BERT. It removes empty columns and removes a (usually very small) number of numeric values, if features are unadjustable for lack of data.
format_DF( data, labelname = "Label", batchname = "Batch", referencename = "Reference", samplename = "Sample", covariatename = NULL, assayname = NULL )
format_DF( data, labelname = "Label", batchname = "Batch", referencename = "Reference", samplename = "Sample", covariatename = NULL, assayname = NULL )
data |
Matrix or dataframe in the format (samples, features). |
labelname |
A string containing the name of the column to use as class labels. The default is "Label". |
batchname |
A string containing the name of the column to use as batch labels. The default is "Batch". |
referencename |
A string containing the name of the column to use as ref. labels. The default is "Reference". |
samplename |
A string containing the name of the column to use as sample name. The default is "Sample". |
covariatename |
A vector containing the names of columns with categorical covariables. The default is NULL, for which all columns with the pattern "Cov" will be selected. Additional column names are "Batch", "Cov_X" (were X may be any number), "Label" and "Sample". |
assayname |
User-defined string that specifies, which assay to select, if the input data is a SummarizedExperiment. The default is NULL. |
The formatted matrix.
The data will be already in the correct format for BERT.
generate_data_covariables( features, batches, samplesperbatch, mvstmt, imbalcov, housekeeping = NULL )
generate_data_covariables( features, batches, samplesperbatch, mvstmt, imbalcov, housekeeping = NULL )
features |
Integer indicating the number of features (e.g. genes/proteins) in the dataset. |
batches |
Integer indicating the number of batches in the dataset. |
samplesperbatch |
Integer indicating the number of of samples per batch. |
mvstmt |
Float (in [0,1)) indicating the fraction of missing values per batch. |
imbalcov |
Float indicating the probability for one of the classes to be drawn as class label for each sample. The second class will have probability of 1-imbalcov |
housekeeping |
If NULL, no huosekeeping features will be simulatd. Else, housepeeping indicates the fraction of of housekeeping features. |
A dataframe containing the simulated data. Column Cov_1 will contain the simulated, imbalanced labels.
# generate dataset with 1000 features, 5 batches, 10 samples per batch and # two genotypes. The class ratio will either be 7:3 or 3:7 per batch. data = generate_data_covariables(1000,5,10, 0.1, 0.3)
# generate dataset with 1000 features, 5 batches, 10 samples per batch and # two genotypes. The class ratio will either be 7:3 or 3:7 per batch. data = generate_data_covariables(1000,5,10, 0.1, 0.3)
The data will be already in the correct format for BERT.
generate_dataset( features, batches, samplesperbatch, mvstmt, classes, housekeeping = NULL, deterministic = FALSE )
generate_dataset( features, batches, samplesperbatch, mvstmt, classes, housekeeping = NULL, deterministic = FALSE )
features |
Integer indicating the number of features (e.g. genes/proteins) in the dataset. |
batches |
Integer indicating the number of batches in the dataset. |
samplesperbatch |
Integer indicating the number of of samples per batch. |
mvstmt |
Float (in [0,1)) indicating the fraction of missing values per batch. |
classes |
Integer indicating the number of classes in the dataset. |
housekeeping |
If NULL, no huosekeeping features will be simulatd. Else, housepeeping indicates the fraction of of housekeeping features. |
deterministic |
Whether to assigns the classes deterministically, instead of random sampling |
A dataframe containing the simulated data.
# generate dataset with 1000 features, 5 batches, 10 samples per batch and # two genotypes data = generate_dataset(1000,5,10, 0.1, 2)
# generate dataset with 1000 features, 5 batches, 10 samples per batch and # two genotypes data = generate_dataset(1000,5,10, 0.1, 2)
This function will be called automatically be BERT on data from each batch independently.
get_adjustable_features(data_batch)
get_adjustable_features(data_batch)
data_batch |
Matrix or dataframe in the format (samples, features). Additional column names are "Batch", "Cov_X" (were X may be any number), "Label", "Reference" and "Sample". |
A logical with TRUE for adjustable features and FALSE for features with too many missing values.
This function will be called automatically be BERT n data from each batch independently.
get_adjustable_features_with_mod(data_batch, mod_batch)
get_adjustable_features_with_mod(data_batch, mod_batch)
data_batch |
Matrix or dataframe in the format (samples, features). Additional column names are "Batch", "Cov_X" (were X may be any number), "Label" and "Sample". |
mod_batch |
Matrix or dataframe in the format (samples, covariates). Contains only the covariates as covariates. |
A logical with TRUE for adjustable features and FALSE for features with too many missing values.
Identifies the adjustable features using only the references. Similar to the function in adjust_features.R but with different arguments
identify_adjustableFeatures_refs(x, batch, idx)
identify_adjustableFeatures_refs(x, batch, idx)
x |
the data matrix |
batch |
the list with the batches |
idx |
the vector indicating whether the respective sample is to be used as references |
vector indicating whether each feature can be adjusted
Identifies the references to use for this specific batch effect adjustment
identify_references(batch, references)
identify_references(batch, references)
batch |
vector of batch numbers. Must contain 2 unique elements |
references |
vector that contains 0, if the sample is to be c-adjusted and a class otherwise |
the indices of the reference samples
This function is usually called by BERT during formatting of the input. The idea is, that Label, Batch and Covariables should only be integers
ordinal_encode(column)
ordinal_encode(column)
column |
The categorical vector |
The encoded vector
Adjusts all chunks of data (in parallel) as far as possible.
parallel_bert( chunks, BPPARAM = BiocParallel::bpparam(), method = "ComBat", combatmode = 1, backend = "default" )
parallel_bert( chunks, BPPARAM = BiocParallel::bpparam(), method = "ComBat", combatmode = 1, backend = "default" )
chunks |
vector with the filenames to the temp files where the sub-matrices are stored |
BPPARAM |
The BiocParallel backend to use. The default is the currently registered backend. |
method |
the BE-correction method to use. Possible choices are ComBat and limma |
combatmode |
The mode to use for combat (ignored if limma). Encoded options 'are the same as for HarmonizR |
backend |
The backend to choose for communicating the data, Valid choices are "default" and "file". The latter will use temp files for communicating data chunks between the processes. |
dataframe with the adjusted matrix
A method to remove batch effects estimated from a subset (references) per batch only. Source code is heavily based on limma::removeBatchEffects by Gordon Smyth and Carolyn de Graaf
removeBatchEffectRefs(x, batch, references)
removeBatchEffectRefs(x, batch, references)
x |
the data matrix with samples in columns and features in rows |
batch |
the batch list as vector. |
references |
a vector of integers, indicating whether the corresponding sample is to be co-adjusted (0) or may be used as a reference (>0) |
the corrected data matrix
Replaces missing values (NaN) by NA, this appears to be faster
replace_missing(data)
replace_missing(data)
data |
The data as dataframe |
The data with the replaced MVs
Strip column labelled Cov_1 from dataframe.
strip_Covariable(dataset)
strip_Covariable(dataset)
dataset |
Dataframe in the shape (samples, features) with additional column Cov_1 |
Dataset without column Cov_1.
Verifies that the input to BERT is valid.
validate_bert_input( data, cores, combatmode, corereduction, stopParBatches, backend, method, qualitycontrol, verify, labelname, batchname, referencename, samplename, covariatename, assayname )
validate_bert_input( data, cores, combatmode, corereduction, stopParBatches, backend, method, qualitycontrol, verify, labelname, batchname, referencename, samplename, covariatename, assayname )
data |
Matrix dataframe/SummarizedExperiment in the format (samples, features). Additional column names are "Batch", "Cov_X" (were X may be any number), "Label", "Sample" and "Reference". Must contain at least two features. |
cores |
The number of cores to use for parallel adjustment. Increasing this number leads to faster adjustment, especially on Linux machines. The default is 1. |
combatmode |
Integer, encoding the parameters to use for ComBat. 1 (default) par.prior = TRUE, mean.only = FALSE 2 par.prior = TRUE, mean.only = TRUE 3 par.prior = FALSE, mean.only = FALSE 4 par.prior = FALSE, mean.only = TRUE Will be ignored, if method!="ComBat". |
corereduction |
Reducing the number of workers by at least this number |
stopParBatches |
The minimum number of batches required at a hierarchy level to proceed with parallelized adjustment. If the number of batches is smaller, adjustment will be performed sequentially to avoid overheads. |
backend |
The backend to choose for communicating the data. Valid choices are "default" and "file". The latter will use temp files for communicating data chunks between the processes. after adjusting all sub-trees as far as possible with the previous number of cores. |
method |
Adjustment method to use. Should either be "ComBat", "limma" or "ref". Also allows "None" for testing purposes, which will perform no BE adjustment |
qualitycontrol |
Boolean indicating, whether ASWs should be computed before and after batch effect adjustment. If TRUE, will compute ASW with respect to the "Batch" and "Label" column (if existent). |
verify |
Whether the input matrix/dataframe needs to be verified before adjustment (faster if FALSE) |
labelname |
A string containing the name of the column to use as class labels. The default is "Label". |
batchname |
A string containing the name of the column to use as batch labels. The default is "Batch". |
referencename |
A string containing the name of the column to use as ref. labels. The default is "Reference". |
samplename |
A string containing the name of the column to use as sample name. The default is "Sample". |
covariatename |
A vector containing the names of columns with categorical covariables. The default is NULL, for which all columns with the pattern "Cov" will be selected. |
assayname |
User-defined string that specifies, which assay to select, if the input data is a SummarizedExperiment. The default is NULL. |
None. Will instead throw an error, if input is not as intended.
Validate the user input to the function generate_dataset. Raises an error if and only if the input is malformatted.
validate_input_generate_dataset( features, batches, samplesperbatch, mvstmt, classes, housekeeping, deterministic )
validate_input_generate_dataset( features, batches, samplesperbatch, mvstmt, classes, housekeeping, deterministic )
features |
Integer indicating the number of features (e.g. genes/proteins) in the dataset. |
batches |
Integer indicating the number of batches in the dataset. |
samplesperbatch |
Integer indicating the number of of samples per batch. |
mvstmt |
Float (in [0,1)) indicating the fraction of missing values per batch. |
classes |
Integer indicating the number of classes in the dataset. |
housekeeping |
If NULL, no huosekeeping features will be simulatd. Else, housepeeping indicates the fraction of of housekeeping features. |
deterministic |
Whether to assigns the classes deterministically, instead of random sampling |
None
Verify that the Reference column of the data contains only zeros and ones (if it is present at all)
verify_references(batch)
verify_references(batch)
batch |
the dataframe for this batch (samples in rows, samples in columns) |
either TRUE (everything correct) or FALSE (something is not correct)