Damsel-workflow

1. Introduction

This document gives an introduction to the R package Damsel, for use in DamID analysis; from BAM file input to gene ontology analysis.

Designed for use with DamID data, the Damsel methodology could be modified for use on any similar technology that seeks to identify enriched regions relative to a control sample.

Utilising the power of edgeR for differential analysis and goseq for gene ontology bias correction, Damsel provides a unique end to end analysis for DamID.

The DamID example data used in this vignette is available in the package and has been taken from Vissers et al., (2018), ‘The Scalloped and Nerfin-1 Transcription Factors Cooperate to Maintain Neuronal Cell Fate’. The fastq files were downloaded from SRA, aligned using Rsubread::index and Rsubread::align, before sorting and making bai files with Samtools.

Installation

if (!require("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}

BiocManager::install("Damsel")
library(Damsel)

Processing the BAM files

As a DamID analysis tool, Damsel requires a GATC region file for analysis. These regions serve as a guide to extract counts from the BAM files.

Introducing the GATC region file

It can be generated with getGatcRegions() using a BSGenome object or a FASTA file.

It is a GRangesList with the consecutive GATC regions across the genome - representing the region (or the length) between GATC sites, as well as the positions of the sites themselves.

If you have another species of DamID data or would prefer to make your own region file, you can use the following function, providing a BSgenome object or a FASTA file.

library(BSgenome.Dmelanogaster.UCSC.dm6)
#> Loading required package: BSgenome
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, saveRDS, setdiff, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#> 
#>     findMatches
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#> 
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#> 
#>     strsplit
#> Loading required package: BiocIO
#> Loading required package: rtracklayer
#> 
#> Attaching package: 'rtracklayer'
#> The following object is masked from 'package:BiocIO':
#> 
#>     FileForFormat
regions_and_sites <- getGatcRegions(BSgenome.Dmelanogaster.UCSC.dm6)
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in GenomeInfoDb::renameSeqlevels(x = df_, value = newStyle): invalid
#> seqlevels 'chrM' ignored
regions <- regions_and_sites$regions
knitr::kable(head(regions))
seqnames start end width strand Position
2L 82 230 149 * chr2L-82
2L 231 371 141 * chr2L-231
2L 372 539 168 * chr2L-372
2L 540 688 149 * chr2L-540
2L 689 829 141 * chr2L-689
2L 830 997 168 * chr2L-830
knitr::kable(head(regions_and_sites$sites))
seqnames start end width strand Position
2L 80 83 4 * chr2L-80
2L 229 232 4 * chr2L-229
2L 370 373 4 * chr2L-370
2L 538 541 4 * chr2L-538
2L 687 690 4 * chr2L-687
2L 828 831 4 * chr2L-828

If you already have your own GATC region file, ensure that it has the same format with 6 columns:

  • Position: chromosome-start
  • seqnames: chromosome name
  • start: start of region
  • end: end of region
  • width: length of region (ensure that is is correct according to [plyranges::as_granges()])

Extracting the counts within the GATC regions

Note: Damsel requires BAM files that have been mapped to the reference genome.

Provided the path to a folder of BAM files (and their .bai files) and the appropriate GATC region file, the function countBamInGATC() will extract the counts for each region for each available BAM and add them as columns to a data frame. The columns will be named by the BAM file name - please rename them before running the function if they do not make sense.

path_to_bams <- system.file("extdata", package = "Damsel")
counts.df <- countBamInGATC(path_to_bams,
    regions = regions
)
  • If necessary, at this stage please rearrange the BAM file columns so they are ordered in the following way: Dam_1, Fusion_1, Dam_2, Fusion_2 etc
knitr::kable(head(counts.df))
X Position seqnames start end width strand dam_1_SRR7948872.BAM sd_1_SRR7948874.BAM dam_2_SRR7948876.BAM sd_2_SRR7948877.BAM
1 chr2L-82 chr2L 82 230 149 * 1.0 0.33 0.0 0.0
2 chr2L-231 chr2L 231 371 141 * 1.5 5.67 87.0 57.5
3 chr2L-372 chr2L 372 539 168 * 2.5 6.17 88.0 58.5
4 chr2L-540 chr2L 540 688 149 * 2.0 4.83 0.0 0.0
5 chr2L-689 chr2L 689 829 141 * 0.0 0.00 0.5 0.5
6 chr2L-830 chr2L 830 997 168 * 0.0 1.33 4.5 3.5

This example data is also directly available as a counts file via data.

data("dros_counts")
  • Do not remove the .bam extension on the column names as this is used as a check in later functions to ensure only the BAM files are selected from the data frame.

  • The DamID data captures the ~75bp region extending from each GATC site, so although regions are of differing widths, there is a null to minimal length bias present on the data, and does not require length correction.

Correlation analysis of samples

At this stage, the similarities and differences between the samples can be analysed via correlation. plotCorrHeatmap plots the correlation of all available BAM files Dam and Fusion, to visualise the similarity between files. The default for all Damsel correlation analysis is the non-parametric “spearman’s” correlation. The correlation between Dam_1 and Fusion_1 can be expected to reach ~ 0.7, whereas the correlation between Dam_1 & Dam_3 or Fusion_1 & Fusion_2 would be expected to be closer to ~0.9

plotCorrHeatmap(df = counts.df, method = "spearman")

Two specific samples can also be compared using ggscatter which plots a scatterplot of the two samples, overlaid with the correlation results. [ggpubr::ggscatter()]

Visualisation of coverage

The overall coverage of different samples can be compared

plotCountsDistribution(counts.df, constant = 1)

A specific region can be selected to view the counts across samples.

plotCounts(counts.df,
    seqnames = "chr2L",
    start_region = 1,
    end_region = 40000,
    layout = "spread"
)

As shown in the following plots, the default layout is "stacked", where the replicates are overlaid. The counts can also be displayed in a log2 ratio with log2_scale=TRUE

Differential methylation analysis

The goal with DamID analysis is to identify regions that are enriched in the fusion sample relative to the control. In Damsel, this step is referred to as differential methylation analysis, and makes use of [edgeR].

For ease of use, Damsel has four main edgeR based functions which compile different steps and functions from within edgeR.

Setting up edgeR analysis

makeDGE sets up the edgeR analysis for differential methylation testing. Taking the data frame of samples and regions as input, it conducts the following steps:

  • it extracts the sample data
  • groups the samples (Dam or Fusion)
  • filters the samples (remove regions with very low counts, the filtering parameters may be adjusted)
  • normalises the data
  • establishes the design matrix (this includes the sample group and pairing replicates together - Dam_1 & Fusion_1)
  • estimates the dispersion
dge <- makeDGE(counts.df)
head(dge)
#> An object of class "DGEList"
#> $counts
#>            dam_1_SRR7948872.BAM sd_1_SRR7948874.BAM dam_2_SRR7948876.BAM
#> chr2L-231                  1.50                5.67                 87.0
#> chr2L-372                  2.50                6.17                 88.0
#> chr2L-3578                 4.00                1.50                  6.0
#> chr2L-4494                 1.42                7.83                  8.0
#> chr2L-4952                 3.58                1.83                  5.5
#> chr2L-5120                 4.42                3.00                  6.0
#>            sd_2_SRR7948877.BAM
#> chr2L-231                57.50
#> chr2L-372                58.50
#> chr2L-3578                5.00
#> chr2L-4494                6.00
#> chr2L-4952                5.33
#> chr2L-5120                5.83
#> 
#> $samples
#>                       group lib.size norm.factors
#> dam_1_SRR7948872.BAM    Dam  2238419    1.7342299
#> sd_1_SRR7948874.BAM  Fusion  2027147    0.8980197
#> dam_2_SRR7948876.BAM    Dam 10752265    1.2158660
#> sd_2_SRR7948877.BAM  Fusion 11540604    0.5281068
#> 
#> $genes
#>              Position seqnames start  end
#> chr2L-231   chr2L-231    chr2L   231  371
#> chr2L-372   chr2L-372    chr2L   372  539
#> chr2L-3578 chr2L-3578    chr2L  3578 3745
#> chr2L-4494 chr2L-4494    chr2L  4494 4661
#> chr2L-4952 chr2L-4952    chr2L  4952 5119
#> chr2L-5120 chr2L-5120    chr2L  5120 5153
#> 
#> $design
#>   (Intercept) group V2
#> 1           1     0  1
#> 2           1     1  1
#> 3           1     0  0
#> 4           1     1  0
#> attr(,"assign")
#> [1] 0 1 2
#> 
#> $common.dispersion
#> [1] 0.0226594
#> 
#> $trended.dispersion
#> [1] 1.471754e-02 1.502779e-02 9.765625e-05 9.765625e-05 9.765625e-05
#> [6] 9.765625e-05
#> 
#> $tagwise.dispersion
#> [1] 1.550797e-02 1.560039e-02 9.765625e-05 9.765625e-05 9.765625e-05
#> [6] 9.765625e-05
#> 
#> $AveLogCPM
#> [1]  2.583582499  2.620713568  0.004159126  0.388971306 -0.007506024
#> [6]  0.175425183
#> 
#> $trend.method
#> [1] "locfit"
#> 
#> $prior.df
#> [1] 115.1677 116.6016 122.7404 109.0151 122.7404 122.7404
#> 
#> $prior.n
#> [1] 115.1677 116.6016 122.7404 109.0151 122.7404 122.7404
#> 
#> $span
#> [1] 0.2767543

The output from this step is a DGEList containing all of the information from the steps.

Examining the data - multidimensional scaling plot

It’s important to visualise the differences between the samples.

You would expect the Dam samples to cluster together, and for the Fusion samples to cluster together. You would expect the majority of the variation to be within the 1st dimension (the x axis), and less variation in the 2nd dimension (y axis)

group <- dge$samples$group %>% as.character()
limma::plotMDS(dge, col = as.numeric(factor(group)))

Identifying differentially methylated regions

After exploring the data visually, it’s time to identify the enriched regions. testDmRegions compiles the edgeR functions for differential testing with one key modification - it outputs the results with the adjusted p values as well as the raw p values.

testDmRegions conducts the following key steps:

    1. fits a QLF model - quasi likelihood
    1. tests the model
    1. conducts p value adjustment and summarises model results by setting regions as either (1,0) (log fold change and p value thresholds can be adjusted)

These results are incorporated with the region data, providing a result for every region. The regions excluded from edgeR analysis are given logFC = 0, and adjust.p = 1 Setting plot=TRUE will plot an [edgeR::plotSmear()] alongside the results

dm_results <- testDmRegions(dge, p.value = 0.05, lfc = 1, regions = regions, plot = TRUE)
#> Warning in plot.xy(xy.coords(x, y), type = type, ...): "panel.first" is not a
#> graphical parameter

dm_results %>%
    dplyr::group_by(meth_status) %>%
    dplyr::summarise(n = dplyr::n())
#> # A tibble: 3 × 2
#>   meth_status       n
#>   <chr>         <int>
#> 1 No_sig        30318
#> 2 Not_included 344362
#> 3 Upreg          8974

knitr::kable(head(dm_results), digits = 32)
Position seqnames start end width strand number dm logFC PValue adjust.p meth_status
chr2L-82 chr2L 82 230 149 * 1 NA 0.0000000 1.00000000 1.00000000 Not_included
chr2L-231 chr2L 231 371 141 * 2 0 0.7664018 0.06416901 0.10412260 No_sig
chr2L-372 chr2L 372 539 168 * 3 0 0.7535973 0.05679149 0.09386122 No_sig
chr2L-540 chr2L 540 688 149 * 4 NA 0.0000000 1.00000000 1.00000000 Not_included
chr2L-689 chr2L 689 829 141 * 5 NA 0.0000000 1.00000000 1.00000000 Not_included
chr2L-830 chr2L 830 997 168 * 6 NA 0.0000000 1.00000000 1.00000000 Not_included

The edgeR results can be plotted alongside the counts as well.

plotCounts(counts.df,
    seqnames = "chr2L",
    start_region = 1,
    end_region = 40000,
    log2_scale = FALSE
) +
    geom_dm(dm_results.df = dm_results)

Only regions that are fully contained within the provided boundaries will be plotted.

  • Add GATC sites
gatc_sites <- regions_and_sites$sites

knitr::kable(head(gatc_sites))
seqnames start end width strand Position
2L 80 83 4 * chr2L-80
2L 229 232 4 * chr2L-229
2L 370 373 4 * chr2L-370
2L 538 541 4 * chr2L-538
2L 687 690 4 * chr2L-687
2L 828 831 4 * chr2L-828
plotCounts(counts.df,
    seqnames = "chr2L",
    start_region = 1,
    end_region = 40000,
    log2_scale = FALSE
) +
    geom_dm(dm_results) +
    geom_gatc(gatc_sites)

Identifying peaks (bridges)

As you could see from the plot of the differential methylation results, there are 10s of 1000s of enriched regions. To reduce the scale of this data to something that can be more biologically meaningul, enriched regions can be compiled into peaks.

Aggregating the regions

Damsel identifies peaks by aggregating regions of enrichment. As DamID sequencing generally sequences the 75 bp from the GATC site, regions smaller than 150 bp are mostly non-significant in statistical testing. Because of this, gaps between peaks of less than or equal to 150 bp are combined into one longer peak.

The FDR and logFC for each peak is calculated via the theory of [csaw::getBestTest()] where the ‘best’ (smallest) p-value in the regions that make up the peak is selected as representative of the peak. The logFC is therefore the corresponding logFC from the selected region.

peaks <- identifyPeaks(dm_results)
nrow(peaks)
#> [1] 700

knitr::kable(head(peaks), digits = 32)
peak_id seqnames start end width strand rank_p logFC_match FDR multiple_peaks region_pos n_regions_dm n_regions_not_dm
PS_1182 chr2L 10488545 10489575 1031 * 1 7.406036 5.605345e-07 1 chr2L-10488545 3 0
PM_900 chr2L 8028432 8032164 3733 * 2 5.594633 5.605345e-07 2 chr2L-8029996 8 1
PS_1442 chr2L 13204670 13207893 3224 * 3 5.085296 5.605345e-07 1 chr2L-13206327 8 0
PM_947 chr2L 8402774 8404891 2118 * 4 4.995993 5.605345e-07 2 chr2L-8403504 11 1
PS_1505 chr2L 13867794 13881891 14098 * 5 4.926633 5.605345e-07 1 chr2L-13871337 34 0
PM_65 chr2L 426798 434692 7895 * 6 4.884160 5.605345e-07 3 chr2L-430375 21 2

Plotting

A peak plot layer can be added to our graph

plotCounts(counts.df,
    seqnames = "chr2L",
    start_region = 1,
    end_region = 40000
) +
    geom_dm(dm_results) +
    geom_peak(peaks, peak.label = TRUE) +
    geom_gatc(gatc_sites)

The default version will not plot the peak.label

The distribution of counts per sample that have contributed to peaks can be compared.

plotCountsInPeaks(counts.df, dm_results, peaks, position = "stack")

Identifying genes associated with peaks

The peak information itself - while interesting, has no biological meaning. As the peaks represent a region that the Fusion protein interacted with on the DNA, likely as a transcription factor, we wish to identify the gene that is being affected. To do so, we need to associate the peaks with a potential “target” gene.

Note: any gene identified here is only a potential target that must be validated in laboratory procedures. There is no method available that is able to accurately predict the location and target genes of enhancers, so a key and potentially incorrect assumption in this part of the analysis is that all peaks represent binding to a local enhancer or promoter - that it is close or overlapping to the target gene.

It must also be noted that the Drosophila melanogaster genome and transcription factor interactions are different to that of mammals and using the same assumptions means results must be taken cautiously. While mammalian genes are generally spread out with little overlap, there is a large amount of overlap between Drosophila genes, requiring some intuitive interpretation of which gene the peak is potentially targeting.

In the Damsel methodology, peaks are considered to associate with genes if they are within 5kb upstream or downstream. If multiple genes are within these criteria, they are all listed, with the closest gene given the primary position.

Extract genes

The function collateGenes() uses two different mechanisms to create a list of genes. It allows for the use of a TxDb object/annotation package, or can access biomaRt.

A TxDb object

The simplest approach is to use a TxDb annotation package. A TxDb package is available for most species and has information about the genes, exons, cds, promoters, etc - which can all be accessed using the GenomicFeatures package. This presentation has a lot of slides on how to use TxDb objects to access different data types: https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicFeatures_In_Bioconductor.html#10

However, TxDb libraries contain only the Ensembl gene Ids and not the gene symbol or name. Instead we need to access an org.Db package to transfer them over. org.Db packages contain information about model organisms genome annotation, and can be used to extract various information about the gene name etc. More information can be found here https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicFeatures_In_Bioconductor.html#46

BiocManager::install("TxDb.Dmelanogaster.UCSC.dm6.ensGene")
BiocManager::install("org.Dm.eg.db")
library("TxDb.Dmelanogaster.UCSC.dm6.ensGene")
#> Loading required package: GenomicFeatures
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
txdb <- TxDb.Dmelanogaster.UCSC.dm6.ensGene::TxDb.Dmelanogaster.UCSC.dm6.ensGene
library("org.Dm.eg.db")
#> 
genes <- collateGenes(genes = txdb, regions = regions, org.Db = org.Dm.eg.db)
#>   1 gene was dropped because it has exons located on both strands of the
#>   same reference sequence or on more than one reference sequence, so
#>   cannot be represented by a single genomic range.
#>   Use 'single.strand.genes.only=FALSE' to get all the genes in a
#>   GRangesList object, or use suppressMessages() to suppress this message.
#> 'select()' returned 1:many mapping between keys and columns
#> TSS taken as start of gene, taking strand into account
knitr::kable(head(genes))
seqnames start end width strand ensembl_gene_id gene_name TSS n_regions
chr2L 7529 9484 1956 + FBgn0031208 CR11023 7529 3
chr2L 9839 21376 11538 - FBgn0002121 l(2)gl 21376 33
chr2L 21823 25155 3333 - FBgn0031209 Ir21a 25155 8
chr2L 21952 24237 2286 + FBgn0263584 asRNA:CR43609 21952 6
chr2L 25402 65404 40003 - FBgn0051973 Cda5 65404 93
chr2L 54817 55767 951 + FBgn0267987 lncRNA:CR46254 54817 1

Using the TxDb package, Damsel assumes that the TSS (transcription start site) is the same as the start site of the gene, taking the strand into account.

Accessing the biomaRt resource

Alternatively, the name of a species listed in biomaRt can be provided, and the version of the genome. The advantage of biomaRt is that a greater amount of information is able to be uncovered, including the canonical transcript. A guide to understanding more about how biomaRt functions is here: https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html

BiocManager::install("biomaRt")
library(biomaRt)
collateGenes(genes = "dmelanogaster_gene_ensembl", regions = regions, version = 109)
#> GRanges object with 23844 ranges and 5 metadata columns:
#>           seqnames          ranges strand | ensembl_gene_id   gene_name
#>              <Rle>       <IRanges>  <Rle> |     <character> <character>
#>       [1]       2L       7529-9484      + |     FBgn0031208     CR11023
#>       [2]       2L       9726-9859      + |     FBti0060580            
#>       [3]       2L      9839-21376      - |     FBgn0002121      l(2)gl
#>       [4]       2L       9888-9949      - |     FBti0059810            
#>       [5]       2L     21823-25155      - |     FBgn0031209       Ir21a
#>       ...      ...             ...    ... .             ...         ...
#>   [23840]        Y 3370887-3371602      + |     FBgn0267426     CR45779
#>   [23841]        Y 3376224-3376984      + |     FBgn0267427     CR45780
#>   [23842]        Y 3539437-3540722      - |     FBgn0046698      Pp1-Y2
#>   [23843]        Y 3561745-3618339      + |     FBgn0046323         ORY
#>   [23844]        Y 3649996-3666928      + |     FBgn0267592         CCY
#>           ensembl_transcript_id       TSS n_regions
#>                     <character> <integer> <numeric>
#>       [1]           FBtr0475186      7529         3
#>       [2]        FBti0060580-RA      9726         1
#>       [3]           FBtr0306592     21376        33
#>       [4]        FBti0059810-RA      9949         0
#>       [5]           FBtr0113008     25155         8
#>       ...                   ...       ...       ...
#>   [23840]           FBtr0346754   3370887         0
#>   [23841]           FBtr0346755   3376224         0
#>   [23842]           FBtr0346673   3540722         6
#>   [23843]           FBtr0346720   3561745       114
#>   [23844]           FBtr0347413   3649996        28
#>   -------
#>   seqinfo: 7 sequences from an unspecified genome; no seqlengths
  • accesses biomaRt using the seqnames of the appropriate GATC region file as a guide
  • accesses biomaRt a second time to obtain only the Ensembl canonical sequence information for each gene
  • identifies the number of GATC regions that overlap with each gene

Annotating genes to peaks

As stated above, Damsel associates genes with peaks if they are within 5 kb upstream or downstream. This maximum distance is an adjustable parameter within the annotatePeaksGenes() function. If set to NULL it will output all possible combinations as defined by plyranges::pair_nearest. The nature of this function means that the closest gene will be found for every peak, even if that distance is in the millions. If the user sets max_distance=NULL, we recommend undergoing some filtering to remove those associations.

To respect that some species have genes with more overlap than others, annotatePeaksGenes outputs a list of data frames. The first, closest, outputs information for every peak and it’s closest gene. The second data frame, top_five, outputs a string of the top five genes (if available) and their distances from each peak. The final data frame, all, provides the raw results and all possible gene and peak associations, as well as all available statistical results.

annotation <- annotatePeaksGenes(peaks, genes, regions = regions, max_distance = 5000)

knitr::kable(head(annotation$closest), digits = 32)
seqnames start end width total_regions n_regions_dm peak_id rank_p gene_position ensembl_gene_id gene_name midpoint_is position
chr2L 5154 9484 4331 8 3 PS_2 607 chr2L:+:7529-9484 FBgn0031208 CR11023 Upstream Peak_upstream
chr2L 9839 21376 11538 33 4 PM_4 627 chr2L:-:9839-21376 FBgn0002121 l(2)gl Upstream Peak_within_gene
chr2L 9839 21376 11538 33 5 PS_6 12 chr2L:-:9839-21376 FBgn0002121 l(2)gl Upstream Peak_within_gene
chr2L 9839 21643 11805 34 5 PS_7 18 chr2L:-:9839-21376 FBgn0002121 l(2)gl Upstream Peak_overlap_downstream
chr2L 21823 25155 3333 8 3 PS_8 552 chr2L:-:21823-25155 FBgn0031209 Ir21a Upstream Peak_within_gene
chr2L 25402 65404 40003 93 3 PM_9 572 chr2L:-:25402-65404 FBgn0051973 Cda5 Upstream Peak_within_gene
knitr::kable(head(annotation$top_five), digits = 32)
seqnames start end peak_id rank_p n_genes list_ensembl list_gene position distance_TSS min_distance
chr2L 5154 9484 PS_2 607 1 FBgn0031208 CR11023 Peak_upstream 1512 649
chr2L 9839 21376 PM_4 627 1 FBgn0002121 l(2)gl Peak_within_gene 8246.5 0
chr2L 9839 21376 PS_6 12 1 FBgn0002121 l(2)gl Peak_within_gene 3489 0
chr2L 9839 21643 PS_7 18 1 FBgn0002121 l(2)gl Peak_overlap_downstream 660 267
chr2L 21823 25155 PS_8 552 2 FBgn0031209, FBgn0263584 Ir21a, asRNA:CR43609 Peak_within_gene, Peak_within_gene 2771, -432 0, 0
chr2L 25402 65404 PM_9 572 1 FBgn0051973 Cda5 Peak_within_gene 30177 0
knitr::kable(str(annotation$all), digits = 32)
#> tibble [1,553 × 33] (S3: tbl_df/tbl/data.frame)
#>  $ seqnames        : Factor w/ 1870 levels "chr2L","chr2R",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ start           : int [1:1553] 5154 9839 9839 9839 21823 21952 25402 65609 65609 65609 ...
#>  $ end             : int [1:1553] 9484 21376 21376 21643 25155 24237 65404 68330 68330 71390 ...
#>  $ width           : num [1:1553] 4331 11538 11538 11805 3333 ...
#>  $ gene_seqnames   : Factor w/ 1870 levels "chr2L","chr2R",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ gene_start      : int [1:1553] 7529 9839 9839 9839 21823 21952 25402 65999 66318 66482 ...
#>  $ gene_end        : int [1:1553] 9484 21376 21376 21376 25155 24237 65404 66242 66524 71390 ...
#>  $ gene_width      : int [1:1553] 1956 11538 11538 11538 3333 2286 40003 244 207 4909 ...
#>  $ gene_strand     : Factor w/ 3 levels "+","-","*": 1 2 2 2 2 1 2 1 1 1 ...
#>  $ peak_seqnames   : Factor w/ 2 levels "chr2L","chr2R": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ peak_start      : int [1:1553] 5154 12694 16957 19789 22097 22097 33758 65609 65609 65609 ...
#>  $ peak_end        : int [1:1553] 6880 13565 18817 21643 22671 22671 36696 68330 68330 68330 ...
#>  $ peak_width      : int [1:1553] 1727 872 1861 1855 575 575 2939 2722 2722 2722 ...
#>  $ peak_strand     : Factor w/ 3 levels "+","-","*": 3 3 3 3 3 3 3 3 3 3 ...
#>  $ ensembl_gene_id : chr [1:1553] "FBgn0031208" "FBgn0002121" "FBgn0002121" "FBgn0002121" ...
#>  $ gene_name       : chr [1:1553] "CR11023" "l(2)gl" "l(2)gl" "l(2)gl" ...
#>  $ TSS             : int [1:1553] 7529 21376 21376 21376 25155 21952 65404 65999 66318 66482 ...
#>  $ n_regions       : num [1:1553] 3 33 33 33 8 6 93 0 0 15 ...
#>  $ peak_id         : chr [1:1553] "PS_2" "PM_4" "PS_6" "PS_7" ...
#>  $ rank_p          : int [1:1553] 607 627 12 18 552 552 572 161 161 161 ...
#>  $ logFC_match     : num [1:1553] 1.39 1.65 4.38 4.47 2.5 ...
#>  $ FDR             : num [1:1553] 3.88e-04 5.75e-04 5.61e-07 6.19e-07 1.97e-04 ...
#>  $ multiple_peaks  : num [1:1553] 1 2 1 1 1 1 2 1 1 1 ...
#>  $ region_pos      : chr [1:1553] "chr2L-5303" "chr2L-12715" "chr2L-17586" "chr2L-20391" ...
#>  $ n_regions_dm    : num [1:1553] 3 4 5 5 3 3 3 7 7 7 ...
#>  $ n_regions_not_dm: num [1:1553] 0 1 0 0 0 0 1 0 0 0 ...
#>  $ peak_midpoint   : num [1:1553] 6017 13130 17887 20716 22384 ...
#>  $ distance_TSS    : num [1:1553] 1512 8246 3489 660 2771 ...
#>  $ midpoint_is     : chr [1:1553] "Upstream" "Upstream" "Upstream" "Upstream" ...
#>  $ n_genes         : int [1:1553] 1 1 1 1 2 2 1 4 4 4 ...
#>  $ position        : chr [1:1553] "Peak_upstream" "Peak_within_gene" "Peak_within_gene" "Peak_overlap_downstream" ...
#>  $ min_distance    : num [1:1553] 649 0 0 267 0 0 0 0 0 873 ...
#>  $ total_regions   : int [1:1553] 8 33 33 34 8 6 93 7 7 18 ...

Interpreting results and plotting

Now that we have the genes from collateGenes(), this can be added as a layer to the previous plots. This plot requires the gene positions as a guide for a Txdb or EnsDb object, building off the autoplot generic built by ggbio.

plotCounts(counts.df,
    seqnames = "chr2L",
    start_region = 1,
    end_region = 40000
) +
    geom_dm(dm_results) +
    geom_peak(peaks) +
    geom_gatc(gatc_sites) +
    geom_genes_tx(genes, txdb)
#> If gene is disproportional to the plot, use gene_limits = c(y1,y2). If gene is too large, recommend setting to c(0,2) and adjusting the plot.height accordingly.
#> Registered S3 method overwritten by 'GGally':
#>   method from   
#>   +.gg   ggplot2
#> Parsing transcripts...
#> Parsing exons...
#> Parsing cds...
#> Parsing utrs...
#> ------exons...
#> ------cdss...
#> ------introns...
#> ------utr...
#> aggregating...
#> Done
#> Constructing graphics...
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.

  • If the scale of the gene plot is disproportional to the height of the overall plot - if it is too large or too squished, it can be adjusted using the plot.height argument.
plotWrap(
    id = peaks[1, ]$peak_id,
    counts.df = counts.df,
    dm_results.df = dm_results,
    peaks.df = peaks,
    gatc_sites.df = gatc_sites,
    genes.df = genes, txdb = txdb
)
#> If gene is disproportional to the plot, use gene_limits = c(y1,y2). If gene is too large, recommend setting to c(0,2) and adjusting the plot.height accordingly.
#> Parsing transcripts...
#> Parsing exons...
#> Parsing cds...
#> Parsing utrs...
#> ------exons...
#> ------cdss...
#> ------introns...
#> ------utr...
#> aggregating...
#> Done
#> Constructing graphics...
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> [[1]]

plotWrap(
    id = genes[1, ]$ensembl_gene_id,
    counts.df = counts.df,
    dm_results.df = dm_results,
    peaks.df = peaks,
    gatc_sites.df = gatc_sites,
    genes.df = genes, txdb = txdb
)
#> No data available for this region
#> If gene is disproportional to the plot, use gene_limits = c(y1,y2). If gene is too large, recommend setting to c(0,2) and adjusting the plot.height accordingly.
#> Parsing transcripts...
#> Parsing exons...
#> Parsing cds...
#> Parsing utrs...
#> ------exons...
#> ------cdss...
#> ------introns...
#> ------utr...
#> aggregating...
#> Done
#> Constructing graphics...
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> [[1]]

Gene ontology

One of the last steps in a typical DamID analysis is gene ontology analysis. However, a key mistake made in this analysis using any common data type - including RNA-seq, is the lack of bias correction. We correct for this by utilising the [goseq] package. Without bias correction, the ontology results would be biased towards longer genes - the longer the gene, the more likely it would be to have a peak associated with it, and therefore be called as significant. We can see this in the plot below.

GO analysis with goseq

testGeneOntology identifies the over-represented GO terms from the peak data, correcting for the number of GATC regions matching to each gene.

3 outputs Plot of goodness of fit of model signif_results: data.frame of significant GO category results, ordered by p-value. prob_weights: data.frame of probability weights for each gene

ontology <- testGeneOntology(annotation$all, genes, regions = regions, extend_by = 2000)
#> Bias will be n_regions that are contained within the gene length
#> 
#> Warning in pcls(G): initial point very close to some inequality constraints

#> Fetching GO annotations...
#> For 4411 genes, we could not find any categories. These genes will be excluded.
#> To force their use, please run with use_genes_without_cat=TRUE (see documentation).
#> This was the default behavior for version 1.15.1 and earlier.
#> Calculating the p-values...
#> 'select()' returned 1:1 mapping between keys and columns

The goodness of fit plot above shows us that there is a length based bias to the data. The x axis shows the number of GATC regions in each gene. The y axis shows the proportion of genes that have that amount of GATC regions and have been identified as significant. And it shows that as the number of GATC regions in the gene increase, as does the proportion of genes that are significant.

knitr::kable(head(ontology$signif_results), digits = 32)
category over_represented_pvalue under_represented_pvalue numDEInCat numInCat term ontology FDR
GO:0043231 1.490095e-09 1.0000000 370 5196 intracellular membrane-bounded organelle CC 1.755183e-05
GO:0043227 3.872693e-09 1.0000000 380 5395 membrane-bounded organelle CC 2.280823e-05
GO:0043229 1.327729e-08 1.0000000 410 5985 intracellular organelle CC 4.143625e-05
GO:0005622 1.407123e-08 1.0000000 489 7433 intracellular anatomical structure CC 4.143625e-05
GO:0043226 2.831401e-08 1.0000000 415 6099 organelle CC 6.670215e-05
GO:0008406 1.256967e-06 0.9999998 15 50 gonad development BP 2.115116e-03
knitr::kable(head(ontology$prob_weights), digits = 32)
DEgenes bias.data pwf
FBgn0031208 1 9 0.04042808
FBgn0002121 1 46 0.06715388
FBgn0031209 1 15 0.04524548
FBgn0263584 1 13 0.04365067
FBgn0051973 1 100 0.08392959
FBgn0267987 0 10 0.04123729

As expected, significant gene ontology terms surround developmental processes, which is expected as the fusion gene in the example data (Scalloped) is well known to be involved in development.

plotGeneOntology can be used to plot the top 10 results.

plotGeneOntology(ontology$signif_results)

As shown above, the plot has the category on the y-axis, the FDR on the x-axis, the size of the dot being the number of genes in the GO category, and the colour of the dot being the ontology (Biological Process, Cellular Component, and Molecular Function).

Appendix

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] biomaRt_2.63.0                            
#>  [2] org.Dm.eg.db_3.20.0                       
#>  [3] TxDb.Dmelanogaster.UCSC.dm6.ensGene_3.12.0
#>  [4] GenomicFeatures_1.57.1                    
#>  [5] AnnotationDbi_1.69.0                      
#>  [6] Biobase_2.67.0                            
#>  [7] BSgenome.Dmelanogaster.UCSC.dm6_1.4.1     
#>  [8] BSgenome_1.73.1                           
#>  [9] rtracklayer_1.65.0                        
#> [10] BiocIO_1.17.0                             
#> [11] Biostrings_2.75.0                         
#> [12] XVector_0.45.0                            
#> [13] GenomicRanges_1.57.2                      
#> [14] GenomeInfoDb_1.41.2                       
#> [15] IRanges_2.39.2                            
#> [16] S4Vectors_0.43.2                          
#> [17] BiocGenerics_0.53.0                       
#> [18] Damsel_1.3.0                              
#> [19] BiocStyle_2.35.0                          
#> 
#> loaded via a namespace (and not attached):
#>   [1] splines_4.4.1               bitops_1.0-9               
#>   [3] filelock_1.0.3              tibble_3.2.1               
#>   [5] BiasedUrn_2.0.12            graph_1.83.0               
#>   [7] XML_3.99-0.17               rpart_4.1.23               
#>   [9] lifecycle_1.0.4             httr2_1.0.5                
#>  [11] edgeR_4.3.21                doParallel_1.0.17          
#>  [13] lattice_0.22-6              ensembldb_2.29.1           
#>  [15] OrganismDbi_1.47.0          backports_1.5.0            
#>  [17] magrittr_2.0.3              limma_3.61.12              
#>  [19] Hmisc_5.2-0                 sass_0.4.9                 
#>  [21] rmarkdown_2.28              jquerylib_0.1.4            
#>  [23] yaml_2.3.10                 ggbio_1.53.0               
#>  [25] DBI_1.2.3                   buildtools_1.0.0           
#>  [27] RColorBrewer_1.1-3          abind_1.4-8                
#>  [29] zlibbioc_1.51.2             purrr_1.0.2                
#>  [31] AnnotationFilter_1.31.0     biovizBase_1.55.0          
#>  [33] RCurl_1.98-1.16             nnet_7.3-19                
#>  [35] VariantAnnotation_1.51.2    rappdirs_0.3.3             
#>  [37] circlize_0.4.16             GenomeInfoDbData_1.2.13    
#>  [39] maketools_1.3.1             codetools_0.2-20           
#>  [41] DelayedArray_0.31.14        xml2_1.3.6                 
#>  [43] tidyselect_1.2.1            shape_1.4.6.1              
#>  [45] UCSC.utils_1.1.0            farver_2.1.2               
#>  [47] goseq_1.57.2                matrixStats_1.4.1          
#>  [49] BiocFileCache_2.15.0        base64enc_0.1-3            
#>  [51] GenomicAlignments_1.41.0    jsonlite_1.8.9             
#>  [53] GetoptLong_1.0.5            Formula_1.2-5              
#>  [55] iterators_1.0.14            foreach_1.5.2              
#>  [57] tools_4.4.1                 progress_1.2.3             
#>  [59] Rcpp_1.0.13                 glue_1.8.0                 
#>  [61] gridExtra_2.3               SparseArray_1.5.45         
#>  [63] xfun_0.48                   geneLenDataBase_1.41.2     
#>  [65] mgcv_1.9-1                  MatrixGenerics_1.17.1      
#>  [67] dplyr_1.1.4                 withr_3.0.2                
#>  [69] BiocManager_1.30.25         fastmap_1.2.0              
#>  [71] GGally_2.2.1                fansi_1.0.6                
#>  [73] digest_0.6.37               R6_2.5.1                   
#>  [75] colorspace_2.1-1            GO.db_3.20.0               
#>  [77] dichromat_2.0-0.1           RSQLite_2.3.7              
#>  [79] utf8_1.2.4                  tidyr_1.3.1                
#>  [81] generics_0.1.3              data.table_1.16.2          
#>  [83] prettyunits_1.2.0           httr_1.4.7                 
#>  [85] htmlwidgets_1.6.4           S4Arrays_1.5.11            
#>  [87] ggstats_0.7.0               pkgconfig_2.0.3            
#>  [89] gtable_0.3.6                blob_1.2.4                 
#>  [91] ComplexHeatmap_2.21.1       sys_3.4.3                  
#>  [93] htmltools_0.5.8.1           RBGL_1.81.0                
#>  [95] plyranges_1.25.0            ProtGenerics_1.37.1        
#>  [97] clue_0.3-65                 scales_1.3.0               
#>  [99] png_0.1-8                   knitr_1.48                 
#> [101] rstudioapi_0.17.1           reshape2_1.4.4             
#> [103] rjson_0.2.23                checkmate_2.3.2            
#> [105] nlme_3.1-166                curl_5.2.3                 
#> [107] cachem_1.1.0                GlobalOptions_0.1.2        
#> [109] stringr_1.5.1               parallel_4.4.1             
#> [111] foreign_0.8-87              restfulr_0.0.15            
#> [113] pillar_1.9.0                grid_4.4.1                 
#> [115] vctrs_0.6.5                 dbplyr_2.5.0               
#> [117] cluster_2.1.6               htmlTable_2.4.3            
#> [119] evaluate_1.0.1              cli_3.6.3                  
#> [121] locfit_1.5-9.10             compiler_4.4.1             
#> [123] Rsamtools_2.21.2            rlang_1.1.4                
#> [125] crayon_1.5.3                labeling_0.4.3             
#> [127] plyr_1.8.9                  stringi_1.8.4              
#> [129] BiocParallel_1.39.0         txdbmaker_1.1.2            
#> [131] munsell_0.5.1               lazyeval_0.2.2             
#> [133] Matrix_1.7-1                hms_1.1.3                  
#> [135] patchwork_1.3.0             bit64_4.5.2                
#> [137] ggplot2_3.5.1               KEGGREST_1.45.1            
#> [139] statmod_1.5.0               SummarizedExperiment_1.35.5
#> [141] highr_0.11                  memoise_2.0.1              
#> [143] bslib_0.8.0                 bit_4.5.0