Title: | Predict cis-co-accessibility from single-cell chromatin accessibility data |
---|---|
Description: | Cicero computes putative cis-regulatory maps from single-cell chromatin accessibility data. It also extends monocle 2 for use in chromatin accessibility data. |
Authors: | Hannah Pliner [aut, cre], Cole Trapnell [aut] |
Maintainer: | Hannah Pliner <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.25.0 |
Built: | 2024-12-24 06:18:52 UTC |
Source: | https://github.com/bioc/cicero |
Cicero computes putative cis-regulatory maps from single-cell chromatin accessibility data. It also extends monocle 2 for use in chromatin accessibility data.
Maintainer: Hannah Pliner [email protected]
Authors:
Cole Trapnell [email protected]
Aggregates a CDS based on an indicator column in the pData
table
aggregate_by_cell_bin(cds, group_col)
aggregate_by_cell_bin(cds, group_col)
cds |
A CDS object to be aggregated |
group_col |
The name of the column in the |
This function takes an input CDS object and collapses cells based
on a column in the pData
table by summing the values within the
cell group.
A count cds aggregated by group_col
data("cicero_data") #input_cds <- make_atac_cds(cicero_data, binarize = TRUE) #pData(input_cds)$cell_subtype <- rep(1:10, times=20) #binned_input_lin <-aggregate_by_cell_bin(input_cds, "cell_subtype")
data("cicero_data") #input_cds <- make_atac_cds(cicero_data, binarize = TRUE) #pData(input_cds)$cell_subtype <- rep(1:10, times=20) #binned_input_lin <-aggregate_by_cell_bin(input_cds, "cell_subtype")
Make an aggregate count cds by collapsing nearby peaks
aggregate_nearby_peaks(cds, distance = 1000)
aggregate_nearby_peaks(cds, distance = 1000)
cds |
A CellDataSet (CDS) object. For example, output of
|
distance |
The distance within which peaks should be collapsed |
This function takes an input CDS object and collapses features within a given distance by summing the values for the collapsed features. Ranges of features are determined by their feature name, so the feature names must be in the form "chr1:1039013-2309023".
A CDS object with aggregated peaks.
data("cicero_data") input_cds <- make_atac_cds(cicero_data, binarize = TRUE) agg_cds <- aggregate_nearby_peaks(input_cds, distance = 10000)
data("cicero_data") input_cds <- make_atac_cds(cicero_data, binarize = TRUE) agg_cds <- aggregate_nearby_peaks(input_cds, distance = 10000)
Annotate the sites of your CDS with feature data based on coordinate overlap.
annotate_cds_by_site( cds, feature_data, verbose = FALSE, maxgap = 0, all = FALSE, header = FALSE )
annotate_cds_by_site( cds, feature_data, verbose = FALSE, maxgap = 0, all = FALSE, header = FALSE )
cds |
A CDS object. |
feature_data |
Data frame, or a character path to a file of
feature data. If a path, the file should be tab separated. Default assumes
no header, if your file has a header, set |
verbose |
Logical, should progress messages be printed? |
maxgap |
The maximum number of base pairs allowed between the peak and
the feature for the feature and peak to be considered overlapping.
Default = 0 (overlapping). Details in
|
all |
Logical, should all overlapping intervals be reported? If all is FALSE, the largest overlap is reported. |
header |
Logical, if reading a file, is there a header? |
annotate_cds_by_site
will add columns to the fData
table of a CDS object based on the overlap of peaks with features in a
data frame or file. An "overlap" column will be added, along with any
columns beyond the three required columns in the feature data. The
"overlap" column is the number of base pairs overlapping the fData
site. When maxgap is used, the true overlap is still calculated (overlap
will be 0 if the two features only overlap because of maxgap) NA
means that there was no overlapping feature. If a peak overlaps multiple
data intervals and all is FALSE, the largest overlapping interval will be
chosen (in a tie, the first entry is taken), otherwise all intervals will
be chosen and annotations will be collapsed using a comma as a separator.
A CDS object with updated fData
table.
data("cicero_data") input_cds <- make_atac_cds(cicero_data, binarize = TRUE) feat <- data.frame(chr = c("chr18", "chr18", "chr18", "chr18"), bp1 = c(10000, 10800, 50000, 100000), bp2 = c(10700, 11000, 60000, 110000), type = c("Acetylated", "Methylated", "Acetylated", "Methylated")) input_cds <- annotate_cds_by_site(input_cds, feat)
data("cicero_data") input_cds <- make_atac_cds(cicero_data, binarize = TRUE) feat <- data.frame(chr = c("chr18", "chr18", "chr18", "chr18"), bp1 = c(10000, 10800, 50000, 100000), bp2 = c(10700, 11000, 60000, 110000), type = c("Acetylated", "Methylated", "Acetylated", "Methylated")) input_cds <- annotate_cds_by_site(input_cds, feat)
Function which takes the output of generate_cicero_models
and
assembles the connections into a data frame with cicero co-accessibility
scores.
assemble_connections(cicero_model_list, silent = FALSE)
assemble_connections(cicero_model_list, silent = FALSE)
cicero_model_list |
A list of cicero output objects, generally, the
output of |
silent |
Logical, should the function run silently? |
This function combines glasso models computed on overlapping windows of the genome. Pairs of sites whose regularized correlation was calculated twice are first checked for qualitative concordance (both zero, positive or negative). If they not concordant, NA is returned. If they are concordant the mean is returned.
A data frame of connections with their cicero co-accessibility scores.
data("cicero_data") data("human.hg19.genome") sample_genome <- subset(human.hg19.genome, V1 == "chr18") sample_genome$V2[1] <- 100000 input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) model_output <- generate_cicero_models(cicero_cds, distance_parameter = 0.3, genomic_coords = sample_genome) cicero_cons <- assemble_connections(model_output)
data("cicero_data") data("human.hg19.genome") sample_genome <- subset(human.hg19.genome, V1 == "chr18") sample_genome$V2[1] <- 100000 input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) model_output <- generate_cicero_models(cicero_cds, distance_parameter = 0.3, genomic_coords = sample_genome) cicero_cons <- assemble_connections(model_output)
This function calculates the initial Cicero gene activity matrix. After this
function, the activity matrix should be normalized with any comparison
matrices using the function normalize_gene_activities
.
build_gene_activity_matrix( input_cds, cicero_cons_info, site_weights = NULL, dist_thresh = 250000, coaccess_cutoff = 0.25 )
build_gene_activity_matrix( input_cds, cicero_cons_info, site_weights = NULL, dist_thresh = 250000, coaccess_cutoff = 0.25 )
input_cds |
Binary sci-ATAC-seq input CDS. The input CDS must have a
column in the fData table called "gene" which is the gene name if the
site is a promoter, and |
cicero_cons_info |
Cicero connections table, generally the output of
|
site_weights |
NULL or an individual weight for each site in input_cds. |
dist_thresh |
The maximum distance in base pairs between pairs of sites to include in the gene activity calculation. |
coaccess_cutoff |
The minimum Cicero co-accessibility score that should be considered connected. |
Unnormalized gene activity matrix.
data("cicero_data") data("human.hg19.genome") sample_genome <- subset(human.hg19.genome, V1 == "chr18") sample_genome$V2[1] <- 100000 input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- detectGenes(input_cds) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) cons <- run_cicero(cicero_cds, sample_genome, sample_num=2) data(gene_annotation_sample) gene_annotation_sub <- gene_annotation_sample[,c(1:3, 8)] names(gene_annotation_sub)[4] <- "gene" input_cds <- annotate_cds_by_site(input_cds, gene_annotation_sub) num_genes <- pData(input_cds)$num_genes_expressed names(num_genes) <- row.names(pData(input_cds)) unnorm_ga <- build_gene_activity_matrix(input_cds, cons)
data("cicero_data") data("human.hg19.genome") sample_genome <- subset(human.hg19.genome, V1 == "chr18") sample_genome$V2[1] <- 100000 input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- detectGenes(input_cds) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) cons <- run_cicero(cicero_cds, sample_genome, sample_num=2) data(gene_annotation_sample) gene_annotation_sub <- gene_annotation_sample[,c(1:3, 8)] names(gene_annotation_sub)[4] <- "gene" input_cds <- annotate_cds_by_site(input_cds, gene_annotation_sub) num_genes <- pData(input_cds)$num_genes_expressed names(num_genes) <- row.names(pData(input_cds)) unnorm_ga <- build_gene_activity_matrix(input_cds, cons)
Metadata information for cicero_data
cell_data
cell_data
A data frame with 200 rows and 2 variables:
Time at cell collection
Cell barcode
A dataset containing a subset of a single-cell ATAC-seq dataset collected on Human Skeletal Muscle Myoblasts. Only includes data from chromosome 18.
cicero_data
cicero_data
A data frame with 35137 rows and 3 variables:
Peak information
Cell ID
Reads per cell per peak
Compare two sets of connections and return a vector of logicals for whether connections in one are present in the other.
compare_connections(conns1, conns2, maxgap = 0)
compare_connections(conns1, conns2, maxgap = 0)
conns1 |
A data frame of Cicero connections, like those output from
|
conns2 |
A data frame of connections to be searched for overlap. The first two columns must be coordinates of genomic sites that are connected. |
maxgap |
The number of base pairs between peaks allowed to be called
overlapping. See |
A vector of logicals of whether the Cicero pair is present in the alternate dataset.
## Not run: cons$in_dataset <- compare_connections(conns, alt_data) ## End(Not run)
## Not run: cons$in_dataset <- compare_connections(conns, alt_data) ## End(Not run)
Construct a data frame of coordinate info from coordinate strings
df_for_coords(coord_strings)
df_for_coords(coord_strings)
coord_strings |
A list of coordinate strings (each like "chr1:500000-1000000") |
Coordinate strings consist of three pieces of information: chromosome, start, and stop. These pieces of information can be separated by the characters ":", "_", or "-". Commas will be removed, not used as separators (ex: "chr18:8,575,097-8,839,855" is ok).
data.frame with three columns, chromosome, starting base pair and ending base pair
df_for_coords(c("chr1:2,039-30,239", "chrX:28884:101293"))
df_for_coords(c("chr1:2,039-30,239", "chrX:28884:101293"))
Function to calculate distance penalty parameter (distance_parameter
)
for random genomic windows. Used to choose distance_parameter
to pass
to generate_cicero_models
.
estimate_distance_parameter( cds, window = 5e+05, maxit = 100, s = 0.75, sample_num = 100, distance_constraint = 250000, distance_parameter_convergence = 1e-22, max_elements = 200, genomic_coords = cicero::human.hg19.genome, max_sample_windows = 500 )
estimate_distance_parameter( cds, window = 5e+05, maxit = 100, s = 0.75, sample_num = 100, distance_constraint = 250000, distance_parameter_convergence = 1e-22, max_elements = 200, genomic_coords = cicero::human.hg19.genome, max_sample_windows = 500 )
cds |
A cicero CDS object generated using |
window |
Size of the genomic window to query, in base pairs. |
maxit |
Maximum number of iterations for distance_parameter estimation. |
s |
Power law value. See details for more information. |
sample_num |
Number of random windows to calculate
|
distance_constraint |
Maximum distance of expected connections. Must be
smaller than |
distance_parameter_convergence |
Convergence step size for
|
max_elements |
Maximum number of elements per window allowed. Prevents very large models from slowing performance. |
genomic_coords |
Either a data frame or a path (character) to a file
with chromosome lengths. The file should have two columns, the first is
the chromosome name (ex. "chr1") and the second is the chromosome length
in base pairs. See |
max_sample_windows |
Maximum number of random windows to screen to find sample_num windows for distance calculation. Default 500. |
The purpose of this function is to calculate the distance scaling
parameter used to adjust the distance-based penalty function used in
Cicero's model calculation. The scaling parameter, in combination with the
power law value s
determines the distance-based penalty.
This function chooses random windows of the genome and calculates a
distance_parameter
. The function returns a vector of values
calculated on these random windows. We recommend using the mean value of
this vector moving forward with Cicero analysis.
The function works by finding the minimum distance scaling parameter such
that no more than 5
distance_constraint
have non-zero entries after graphical lasso
regularization and such that fewer than 80
nonzero.
If the chosen random window has fewer than 2 or greater than
max_elements
sites, the window is skipped. In addition, the random
window will be skipped if there are insufficient long-range comparisons
(see below) to be made. The max_elements
parameter exist to prevent
very dense windows from slowing the calculation. If you expect that your
data may regularly have this many sites in a window, you will need to
raise this parameter.
Calculating the distance_parameter
in a sample window requires
peaks in that window that are at a distance greater than the
distance_constraint
parameter. If there are not enough examples at
high distance have been found, the function will return the warning
"Warning: could not calculate sample_num distance_parameters - see
documentation details"
.When looking for sample_num
example
windows, the function will search max_sample_windows
windows. By
default this is set at 500, which should be well beyond the 100 windows
that need to be found. However, in very sparse datasets, increasing
max_sample_windows
may help avoid the above warning. Increasing
max_sample_windows
may slow performance in sparse datasets. If you
are still not able to get enough example windows, even with a large
max_sample_windows
paramter, this may mean your window
parameter needs to be larger or your distance_constraint
parameter
needs to be smaller. A less likely possibility is that your
max_elements
parameter needs to be larger. This would occur if your
data is particularly dense.
The parameter s
is a constant that captures the power-law
distribution of contact frequencies between different locations in the
genome as a function of their linear distance. For a complete discussion
of the various polymer models of DNA packed into the nucleus and of
justifiable values for s, we refer readers to (Dekker et al., 2013) for a
discussion of justifiable values for s. We use a value of 0.75 by default
in Cicero, which corresponds to the “tension globule” polymer model of DNA
(Sanborn et al., 2015). This parameter must be the same as the s parameter
for generate_cicero_models.
Further details are available in the publication that accompanies this
package. Run citation("cicero")
for publication details.
A list of results of length sample_num
. List members are
numeric distance_parameter
values.
Dekker, J., Marti-Renom, M.A., and Mirny, L.A. (2013). Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat. Rev. Genet. 14, 390–403.
Sanborn, A.L., Rao, S.S.P., Huang, S.-C., Durand, N.C., Huntley, M.H., Jewett, A.I., Bochkov, I.D., Chinnappan, D., Cutkosky, A., Li, J., et al. (2015). Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proc. Natl. Acad. Sci. U. S. A. 112, E6456–E6465.
data("cicero_data") data("human.hg19.genome") sample_genome <- subset(human.hg19.genome, V1 == "chr18") sample_genome$V2[1] <- 100000 input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) distance_parameters <- estimate_distance_parameter(cicero_cds, sample_num=5, genomic_coords = sample_genome)
data("cicero_data") data("human.hg19.genome") sample_genome <- subset(human.hg19.genome, V1 == "chr18") sample_genome$V2[1] <- 100000 input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) distance_parameters <- estimate_distance_parameter(cicero_cds, sample_num=5, genomic_coords = sample_genome)
Find CCANs that overlap each other in genomic coordinates
find_overlapping_ccans(ccan_assignments, min_overlap = 1)
find_overlapping_ccans(ccan_assignments, min_overlap = 1)
ccan_assignments |
A data frame where the first column is the peak and
the second is the CCAN assignment. For example, output of
|
min_overlap |
The minimum base pair overlap to count as overlapping. |
A data frame with two columns, CCAN1 and CCAN2. CCANs in this list are overlapping. The data frame is reciprocal (if CCAN 2 overlaps CCAN 1, there will be two rows, 1,2 and 2,1).
ccan_df <- data.frame(peak = c("chr18_1408345_1408845", "chr18_1779830_1780330", "chr18_1929095_1929595", "chr18_1954501_1954727", "chr18_2049865_2050884", "chr18_2083726_2084102", "chr18_2087935_2088622", "chr18_2104705_2105551", "chr18_2108641_2108907"), CCAN = c(1,2,2,2,3,3,3,3,2)) olap_ccans <- find_overlapping_ccans(ccan_df)
ccan_df <- data.frame(peak = c("chr18_1408345_1408845", "chr18_1779830_1780330", "chr18_1929095_1929595", "chr18_1954501_1954727", "chr18_2049865_2050884", "chr18_2083726_2084102", "chr18_2087935_2088622", "chr18_2104705_2105551", "chr18_2108641_2108907"), CCAN = c(1,2,2,2,3,3,3,3,2)) olap_ccans <- find_overlapping_ccans(ccan_df)
Find peaks that overlap a specific genomic location
find_overlapping_coordinates(coord_list, coord, maxgap = 0)
find_overlapping_coordinates(coord_list, coord, maxgap = 0)
coord_list |
A list of coordinates to be searched for overlap in the form chr_100_2000. |
coord |
The coordinates that you want to find in the form chr1_100_2000. |
maxgap |
The maximum distance in base pairs between coord and the coord_list that should count as overlapping. Default is 0. |
A character vector of the peaks that overlap coord.
test_coords <- c("chr18_10025_10225", "chr18_10603_11103", "chr18_11604_13986", "chr18_157883_158536", "chr18_217477_218555", "chr18_245734_246234") find_overlapping_coordinates(test_coords, "chr18:10,100-1246234")
test_coords <- c("chr18_10025_10225", "chr18_10603_11103", "chr18_11604_13986", "chr18_157883_158536", "chr18_217477_218555", "chr18_245734_246234") find_overlapping_coordinates(test_coords, "chr18:10,100-1246234")
Gencode gene annotation data from chromosome 18 of the human genome (hg19).
gene_annotation_sample
gene_annotation_sample
A data frame with 15129 rows and 8 variables:
Chromosome
Exon starting base
Exon ending base
Exon mapping direction
Feature type
Gene ID
Transcript ID
Gene symbol
Post process cicero co-accessibility scores to extract modules of sites that are co-accessible.
generate_ccans( connections_df, coaccess_cutoff_override = NULL, tolerance_digits = 2 )
generate_ccans( connections_df, coaccess_cutoff_override = NULL, tolerance_digits = 2 )
connections_df |
Data frame of connections with columns: Peak1, Peak2,
coaccess. Generally, the output of |
coaccess_cutoff_override |
Numeric, co-accessibility score threshold to impose. Overrides automatic calculation. |
tolerance_digits |
The number of digits to calculate cutoff to. Default is 2 (0.01 tolerance) |
CCANs are calculated by first specifying a minimum co-accessibility
score and then using the Louvain community detection algorithm on the
subgraph induced by excluding edges below this score. For this function,
either the user can specify the minimum co-accessibility using
coaccess_cutoff_override
, or the cutoff can be calculated
automatically by optimizing for CCAN number. The cutoff calculation can be
slow, so users may wish to use the coaccess_cutoff_override
after
initially calculating the cutoff to speed future runs.
Data frame with two columns - Peak and CCAN. CCAN column indicates CCAN assignment. Peaks not included in a CCAN are not returned.
## Not run: data("cicero_data") set.seed(18) data("human.hg19.genome") sample_genome <- subset(human.hg19.genome, V1 == "chr18") sample_genome$V2[1] <- 100000 input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) cicero_cons <- run_cicero(cicero_cds, sample_genome, sample_num = 2) ccan_assigns <- generate_ccans(cicero_cons) ## End(Not run)
## Not run: data("cicero_data") set.seed(18) data("human.hg19.genome") sample_genome <- subset(human.hg19.genome, V1 == "chr18") sample_genome$V2[1] <- 100000 input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) cicero_cons <- run_cicero(cicero_cds, sample_genome, sample_num = 2) ccan_assigns <- generate_ccans(cicero_cons) ## End(Not run)
Function to generate graphical lasso models on all sites in a CDS object within overlapping genomic windows.
generate_cicero_models( cds, distance_parameter, s = 0.75, window = 5e+05, max_elements = 200, genomic_coords = cicero::human.hg19.genome )
generate_cicero_models( cds, distance_parameter, s = 0.75, window = 5e+05, max_elements = 200, genomic_coords = cicero::human.hg19.genome )
cds |
A cicero CDS object generated using |
distance_parameter |
Distance based penalty parameter value. Generally,
the mean of the calculated |
s |
Power law value. See details. |
window |
Size of the genomic window to query, in base pairs. |
max_elements |
Maximum number of elements per window allowed. Prevents very large models from slowing performance. |
genomic_coords |
Either a data frame or a path (character) to a file
with chromosome lengths. The file should have two columns, the first is
the chromosome name (ex. "chr1") and the second is the chromosome length
in base pairs. See |
The purpose of this function is to compute the raw covariances
between each pair of sites within overlapping windows of the genome.
Within each window, the function then estimates a regularized correlation
matrix using the graphical LASSO (Friedman et al., 2008), penalizing pairs
of distant sites more than proximal sites. The scaling parameter,
distance_parameter
, in combination with the power law value s
determines the distance-based penalty.
The parameter s
is a constant that captures the power-law
distribution of contact frequencies between different locations in the
genome as a function of their linear distance. For a complete discussion
of the various polymer models of DNA packed into the nucleus and of
justifiable values for s, we refer readers to (Dekker et al., 2013) for a
discussion of justifiable values for s. We use a value of 0.75 by default
in Cicero, which corresponds to the “tension globule” polymer model of DNA
(Sanborn et al., 2015). This parameter must be the same as the s parameter
for estimate_distance_parameter
.
Further details are available in the publication that accompanies this
package. Run citation("cicero")
for publication details.
A list of results for each window. Either a glasso
object, or
a character description of why the window was skipped. This list can be
directly input into assemble_connections
to create a
reconciled list of cicero co-accessibility scores.
Dekker, J., Marti-Renom, M.A., and Mirny, L.A. (2013). Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat. Rev. Genet. 14, 390–403.
Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441.
Sanborn, A.L., Rao, S.S.P., Huang, S.-C., Durand, N.C., Huntley, M.H., Jewett, A.I., Bochkov, I.D., Chinnappan, D., Cutkosky, A., Li, J., et al. (2015). Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proc. Natl. Acad. Sci. U. S. A. 112, E6456–E6465.
data("cicero_data") data("human.hg19.genome") sample_genome <- subset(human.hg19.genome, V1 == "chr18") sample_genome$V2[1] <- 100000 input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) model_output <- generate_cicero_models(cicero_cds, distance_parameter = 0.3, genomic_coords = sample_genome)
data("cicero_data") data("human.hg19.genome") sample_genome <- subset(human.hg19.genome, V1 == "chr18") sample_genome$V2[1] <- 100000 input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) model_output <- generate_cicero_models(cicero_cds, distance_parameter = 0.3, genomic_coords = sample_genome)
A list of the chromosomes in hg19 and their lengths in base pairs.
human.hg19.genome
human.hg19.genome
A data frame with 93 rows and 2 variables:
Chromosome
Chromosome length, base pairs
This function takes as input a data frame or a path to a file in a sparse
matrix format and returns a properly formatted CellDataSet
(CDS)
object.
make_atac_cds(input, binarize = FALSE)
make_atac_cds(input, binarize = FALSE)
input |
Either a data frame or a path to input data. If a file, it should be a tab-delimited text file with three columns and no header. For either a file or a data frame, the first column is the peak coordinates in the form "chr10_100013372_100013596", the second column is the cell name, and the third column is an integer that represents the number of reads from that cell overlapping that peak. Zero values do not need to be included (sparse matrix format). |
binarize |
Logical. Should the count matrix be converted to binary? |
A CDS object containing your ATAC data in proper format.
data("cicero_data") input_cds <- make_atac_cds(cicero_data, binarize = TRUE)
data("cicero_data") input_cds <- make_atac_cds(cicero_data, binarize = TRUE)
Function to generate an aggregated input CDS for cicero. run_cicero
takes as input an aggregated cicero CDS object. This function will generate
the CDS given an input CDS (perhaps generated by make_atac_cds
) and
a value for k, which is the number of cells to be aggregated per bin. The
default value for k is 50.
make_cicero_cds( cds, reduced_coordinates, k = 50, summary_stats = NULL, size_factor_normalize = TRUE, silent = FALSE, return_agg_info = FALSE )
make_cicero_cds( cds, reduced_coordinates, k = 50, summary_stats = NULL, size_factor_normalize = TRUE, silent = FALSE, return_agg_info = FALSE )
cds |
Input CDS object. |
reduced_coordinates |
A data frame with columns representing the
coordinates of each cell in reduced dimension space (generally 2-3
dimensions). |
k |
Number of cells to aggregate per bin. |
summary_stats |
Which numeric |
size_factor_normalize |
Logical, should accessibility values be normalized by size factor? |
silent |
Logical, should warning and info messages be printed? |
return_agg_info |
Logical, should a list of the assignments of cells to
aggregated bins be output? When |
Aggregation of similar cells is done using a k-nearest-neighbors
graph and a randomized "bagging" procedure. Details are available in the
publication that accompanies this package. Run citation("cicero")
for publication details. KNN is calculated using
knn.index
Aggregated CDS object. If return_agg_info is TRUE
, a list
of the aggregated CDS object and a data.frame of aggregation info.
## Not run: data("cicero_data") input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) ## End(Not run)
## Not run: data("cicero_data") input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) ## End(Not run)
Convert a data frame into a square sparse matrix (all versus all)
make_sparse_matrix(data, i.name = "Peak1", j.name = "Peak2", x.name = "value")
make_sparse_matrix(data, i.name = "Peak1", j.name = "Peak2", x.name = "value")
data |
data frame |
i.name |
name of i column |
j.name |
name of j column |
x.name |
name of value column |
sparse matrix
Normalize the output of build_gene_activity_matrix
. Input is
either one or multiple gene activity matrices. Any gene activities to be
compared amongst each other should be normalized together.
normalize_gene_activities(activity_matrices, cell_num_genes)
normalize_gene_activities(activity_matrices, cell_num_genes)
activity_matrices |
A gene activity matrix, output from
|
cell_num_genes |
A named vector of the total number of accessible sites per cell. Names should correspond to the cell names in the activity matrices. These values can be found in the "num_genes_expressed" column of the pData table of the CDS used to calculate the gene activity matrix. |
Normalized activity matrix or matrices.
data("cicero_data") data("human.hg19.genome") sample_genome <- subset(human.hg19.genome, V1 == "chr18") sample_genome$V2[1] <- 100000 input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- detectGenes(input_cds) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) cons <- run_cicero(cicero_cds, sample_genome, sample_num=2) data(gene_annotation_sample) gene_annotation_sub <- gene_annotation_sample[,c(1:3, 8)] names(gene_annotation_sub)[4] <- "gene" input_cds <- annotate_cds_by_site(input_cds, gene_annotation_sub) num_genes <- pData(input_cds)$num_genes_expressed names(num_genes) <- row.names(pData(input_cds)) unnorm_ga <- build_gene_activity_matrix(input_cds, cons) cicero_gene_activities <- normalize_gene_activities(unnorm_ga, num_genes)
data("cicero_data") data("human.hg19.genome") sample_genome <- subset(human.hg19.genome, V1 == "chr18") sample_genome$V2[1] <- 100000 input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- detectGenes(input_cds) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) cons <- run_cicero(cicero_cds, sample_genome, sample_num=2) data(gene_annotation_sample) gene_annotation_sub <- gene_annotation_sample[,c(1:3, 8)] names(gene_annotation_sub)[4] <- "gene" input_cds <- annotate_cds_by_site(input_cds, gene_annotation_sub) num_genes <- pData(input_cds)$num_genes_expressed names(num_genes) <- row.names(pData(input_cds)) unnorm_ga <- build_gene_activity_matrix(input_cds, cons) cicero_gene_activities <- normalize_gene_activities(unnorm_ga, num_genes)
Make a barplot of chromatin accessibility across pseudotime
plot_accessibility_in_pseudotime(cds_subset, breaks = 10)
plot_accessibility_in_pseudotime(cds_subset, breaks = 10)
cds_subset |
Subset of the CDS object you want to plot. The CDS must have a column in the pData table called "Pseudotime". |
breaks |
Number of breaks along pseudotime. Controls the coarseness of the plot. |
This function plots each site in the CDS subset by cell pseudotime
as a barplot. Cells are divided into bins by pseudotime (number determined
by breaks
) and the percent of cells in each bin that are accessible
is represented by bar height. In addition, the black line represents the
pseudotime-dependent average accessibility from a smoothed binomial
regression.
ggplot object
## Not run: plot_accessibility_in_pseudotime(input_cds_lin[c("chr18_38156577_38158261", "chr18_48373358_48374180", "chr18_60457956_60459080")]) ## End(Not run)
## Not run: plot_accessibility_in_pseudotime(input_cds_lin[c("chr18_38156577_38158261", "chr18_48373358_48374180", "chr18_60457956_60459080")]) ## End(Not run)
Plotting function for Cicero connections. Uses plotTracks
as its basis
plot_connections( connection_df, chr, minbp, maxbp, coaccess_cutoff = 0, peak_color = "#B4656F", connection_color = "#7F7CAF", connection_color_legend = TRUE, alpha_by_coaccess = FALSE, connection_width = 2, connection_ymax = NULL, gene_model = NULL, gene_model_color = "#81D2C7", gene_model_shape = c("smallArrow", "box"), comparison_track = NULL, comparison_coaccess_cutoff = 0, comparison_peak_color = "#B4656F", comparison_connection_color = "#7F7CAF", comparison_connection_color_legend = TRUE, comparison_connection_width = 2, comparison_ymax = NULL, collapseTranscripts = FALSE, include_axis_track = TRUE, return_as_list = FALSE, viewpoint = NULL, comparison_viewpoint = TRUE, viewpoint_color = "#F0544F", viewpoint_fill = "#EFD8D7", viewpoint_alpha = 0.5 )
plot_connections( connection_df, chr, minbp, maxbp, coaccess_cutoff = 0, peak_color = "#B4656F", connection_color = "#7F7CAF", connection_color_legend = TRUE, alpha_by_coaccess = FALSE, connection_width = 2, connection_ymax = NULL, gene_model = NULL, gene_model_color = "#81D2C7", gene_model_shape = c("smallArrow", "box"), comparison_track = NULL, comparison_coaccess_cutoff = 0, comparison_peak_color = "#B4656F", comparison_connection_color = "#7F7CAF", comparison_connection_color_legend = TRUE, comparison_connection_width = 2, comparison_ymax = NULL, collapseTranscripts = FALSE, include_axis_track = TRUE, return_as_list = FALSE, viewpoint = NULL, comparison_viewpoint = TRUE, viewpoint_color = "#F0544F", viewpoint_fill = "#EFD8D7", viewpoint_alpha = 0.5 )
connection_df |
Data frame of connections, which must include the columns 'Peak1', 'Peak2', and 'coaccess'. Generally, the output of run_cicero or assemble_connections. |
chr |
The chromosome of the region you would like to plot in the form 'chr10'. |
minbp |
The base pair coordinate of the start of the region to be plotted. |
maxbp |
The base pair coordinate of the end of the region to be plotted. |
coaccess_cutoff |
The minimum cicero co-accessibility score you would like to be plotted. Default is 0. |
peak_color |
Color for peak annotations - a single color, the name of a column containing color values that correspond to Peak1, or the name of column containing a character or factor to base peak colors on. |
connection_color |
Color for connection lines. A single color, the name of a column containing color values, or the name of a column containing a character or factor to base connection colors on. |
connection_color_legend |
Logical, should connection color legend be shown? |
alpha_by_coaccess |
Logical, should the transparency of connection lines be scaled based on co-accessibility score? |
connection_width |
Width of connection lines. |
connection_ymax |
Connection y-axis height. If |
gene_model |
Either |
gene_model_color |
Color for gene annotations. |
gene_model_shape |
Shape for gene models, passed to
|
comparison_track |
Either |
comparison_coaccess_cutoff |
The minimum cicero co-accessibility score you would like to be plotted for the comparison dataset. Default = 0. |
comparison_peak_color |
Color for comparison peak annotations - a single color, the name of a column containing color values that correspond to Peak1, or the name of a column containing a character or factor to base peak colors on. |
comparison_connection_color |
Color for comparison connection lines. A single color, the name of a column containing color values, or the name of a column containing a character or factor to base connection colors on. |
comparison_connection_color_legend |
Logical, should comparison connection color legend be shown? |
comparison_connection_width |
Width of comparison connection lines. |
comparison_ymax |
Connection y-axis height for comparison track. If
|
collapseTranscripts |
Logical or character scalar. Can be one in
|
include_axis_track |
Logical, should a genomic axis be plotted? |
return_as_list |
Logical, if TRUE, the function will not plot, but will
return the plot components as a list. Allows user to add/customize Gviz
components and plot them separately using |
viewpoint |
|
comparison_viewpoint |
Logical, should viewpoint apply to comparison track as well? |
viewpoint_color |
Color for the highlight border. |
viewpoint_fill |
Color for the highlight fill. |
viewpoint_alpha |
Alpha value for the highlight fill. |
A gene region plot, or list of components if return_as_list is
TRUE
.
cicero_cons <- data.frame( Peak1 = c("chr18_10034652_10034983", "chr18_10034652_10034983", "chr18_10034652_10034983", "chr18_10034652_10034983", "chr18_10087586_10087901", "chr18_10120685_10127115", "chr18_10097718_10097934", "chr18_10087586_10087901", "chr18_10154818_10155215", "chr18_10238762_10238983", "chr18_10198959_10199183", "chr18_10250985_10251585"), Peak2 = c("chr18_10097718_10097934", "chr18_10087586_10087901", "chr18_10154818_10155215", "chr18_10238762_10238983", "chr18_10198959_10199183", "chr18_10250985_10251585", "chr18_10034652_10034983", "chr18_10034652_10034983", "chr18_10034652_10034983", "chr18_10034652_10034983", "chr18_10087586_10087901", "chr18_10120685_10127115"), coaccess = c(0.0051121787, 0.0016698617, 0.0006570246, 0.0013466927, 0.0737935011, 0.3264019452, 0.0051121787, 0.0016698617, 0.0006570246, 0.0013466927, 0.0737935011, 0.3264019452)) plot_connections(cicero_cons, chr = "chr18", minbp = 10034652, maxbp = 10251585, peak_color = "purple")
cicero_cons <- data.frame( Peak1 = c("chr18_10034652_10034983", "chr18_10034652_10034983", "chr18_10034652_10034983", "chr18_10034652_10034983", "chr18_10087586_10087901", "chr18_10120685_10127115", "chr18_10097718_10097934", "chr18_10087586_10087901", "chr18_10154818_10155215", "chr18_10238762_10238983", "chr18_10198959_10199183", "chr18_10250985_10251585"), Peak2 = c("chr18_10097718_10097934", "chr18_10087586_10087901", "chr18_10154818_10155215", "chr18_10238762_10238983", "chr18_10198959_10199183", "chr18_10250985_10251585", "chr18_10034652_10034983", "chr18_10034652_10034983", "chr18_10034652_10034983", "chr18_10034652_10034983", "chr18_10087586_10087901", "chr18_10120685_10127115"), coaccess = c(0.0051121787, 0.0016698617, 0.0006570246, 0.0013466927, 0.0737935011, 0.3264019452, 0.0051121787, 0.0016698617, 0.0006570246, 0.0013466927, 0.0737935011, 0.3264019452)) plot_connections(cicero_cons, chr = "chr18", minbp = 10034652, maxbp = 10251585, peak_color = "purple")
Construct GRanges objects from coordinate strings
ranges_for_coords(coord_strings, meta_data_df = NULL, with_names = FALSE)
ranges_for_coords(coord_strings, meta_data_df = NULL, with_names = FALSE)
coord_strings |
A list of coordinate strings (in the form "chr1:500000-1000000") |
meta_data_df |
A data frame with any meta data columns you want included with the ranges. Must be in the same order as coord_strings. |
with_names |
logical - should meta data include coordinate string (field coord_string)? |
Coordinate strings consist of three pieces of information: chromosome, start, and stop. These pieces of information can be separated by the characters ":", "_", or "-". Commas will be removed, not used as separators (ex: "chr18:8,575,097-8,839,855" is ok).
GRanges object of the input strings
ran1 <- ranges_for_coords("chr1:2039-30239", with_names = TRUE) ran2 <- ranges_for_coords(c("chr1:2049-203902", "chrX:489249-1389389"), meta_data_df = data.frame(dat = c("1", "X"))) ran3 <- ranges_for_coords(c("chr1:2049-203902", "chrX:489249-1389389"), with_names = TRUE, meta_data_df = data.frame(dat = c("1", "X"), stringsAsFactors = FALSE))
ran1 <- ranges_for_coords("chr1:2039-30239", with_names = TRUE) ran2 <- ranges_for_coords(c("chr1:2049-203902", "chrX:489249-1389389"), meta_data_df = data.frame(dat = c("1", "X"))) ran3 <- ranges_for_coords(c("chr1:2049-203902", "chrX:489249-1389389"), with_names = TRUE, meta_data_df = data.frame(dat = c("1", "X"), stringsAsFactors = FALSE))
A wrapper function that runs the primary functions of the Cicero pipeline
with default parameters. Runs estimate_distance_parameter
,
generate_cicero_models
and assemble_connections
.
See the manual pages of these functions for details about their function and
parameter options. Defaults in this function are designed for mammalian data,
those with non-mammalian data should read about parameters in the above
functions.
run_cicero( cds, genomic_coords, window = 5e+05, silent = FALSE, sample_num = 100 )
run_cicero( cds, genomic_coords, window = 5e+05, silent = FALSE, sample_num = 100 )
cds |
Cicero CDS object, created using |
genomic_coords |
Either a data frame or a path (character) to a file
with chromosome lengths. The file should have two columns, the first is
the chromosome name (ex. "chr1") and the second is the chromosome length
in base pairs. See |
window |
Size of the genomic window to query, in base pairs. |
silent |
Whether to print progress messages |
sample_num |
How many sample genomic windows to use to generate
|
A table of co-accessibility scores
data("cicero_data") data("human.hg19.genome") sample_genome <- subset(human.hg19.genome, V1 == "chr18") sample_genome$V2[1] <- 100000 input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) cons <- run_cicero(cicero_cds, sample_genome, sample_num = 2)
data("cicero_data") data("human.hg19.genome") sample_genome <- subset(human.hg19.genome, V1 == "chr18") sample_genome$V2[1] <- 100000 input_cds <- make_atac_cds(cicero_data, binarize = TRUE) input_cds <- reduceDimension(input_cds, max_components = 2, num_dim=6, reduction_method = 'tSNE', norm_method = "none") tsne_coords <- t(reducedDimA(input_cds)) row.names(tsne_coords) <- row.names(pData(input_cds)) cicero_cds <- make_cicero_cds(input_cds, reduced_coordinates = tsne_coords) cons <- run_cicero(cicero_cds, sample_genome, sample_num = 2)