Bioconductor has a rich ecosystem of metadata around packages, usage, and build status. This package is a simple collection of functions to access that metadata from R in a tidy data format. The goal is to expose metadata for data mining and value-added functionality such as package searching, text mining, and analytics on packages.
Functionality includes access to :
The Bioconductor build reports are available online as HTML pages.
However, they are not very computable. The biocBuildReport
function does some heroic parsing of the HTML to produce a tidy
data.frame for further processing in R.
## # A tibble: 6 × 12
## pkg author version git_last_commit git_last_commit_date pkgType Deprecated
## <chr> <chr> <chr> <chr> <dttm> <chr> <lgl>
## 1 ABSSeq Wentao… 1.60.0 39efc10 2024-10-29 09:51:12 bioc FALSE
## 2 ABSSeq Wentao… 1.60.0 39efc10 2024-10-29 09:51:12 bioc FALSE
## 3 ABSSeq Wentao… 1.60.0 39efc10 2024-10-29 09:51:12 bioc FALSE
## 4 ABSSeq Wentao… 1.60.0 39efc10 2024-10-29 09:51:12 bioc FALSE
## 5 ABSSeq Wentao… 1.60.0 39efc10 2024-10-29 09:51:12 bioc FALSE
## 6 ABSSeq Wentao… 1.60.0 39efc10 2024-10-29 09:51:12 bioc FALSE
## # ℹ 5 more variables: PackageStatus <chr>, node <chr>, stage <chr>,
## # result <chr>, bioc_version <chr>
Because developers may be interested in a quick view of their own
packages, there is a simple function, problemPage
, to
produce an HTML report of the build status of packages matching a given
author regex supplied to the authorPattern
argument. The default is to report only “problem” build statuses (ERROR,
WARNING).
In similar fashion, maintainers of packages that have many downstream
packages that depend on them may wish to check that a change they
introduced hasn’t suddenly broken a large number of these. You can use
the dependsOn
argument to produce the summary report of
those packages that “depend on” the given package.
When run in an interactive environment, the problemPage
function will open a browser window for user interaction. Note that if
you want to include all your package results, not just the broken ones,
simply specify includeOK = TRUE
.
Bioconductor supplies download stats for all packages. The
biocDownloadStats
function grabs all available download
stats for all packages in all Experiment Data, Annotation Data, and
Software packages. The results are returned as a tidy data.frame for
further analysis.
## # A tibble: 6 × 7
## pkgType Package Year Month Nb_of_distinct_IPs Nb_of_downloads Date
## <chr> <chr> <int> <chr> <int> <int> <date>
## 1 software a4 2024 Jan 75 320 2024-01-01
## 2 software a4 2024 Feb 85 245 2024-02-01
## 3 software a4 2024 Mar 156 296 2024-03-01
## 4 software a4 2024 Apr 247 577 2024-04-01
## 5 software a4 2024 May 108 510 2024-05-01
## 6 software a4 2024 Jun 79 811 2024-06-01
The download statistics reported are for all available
versions of a package. There are no separate, publicly
available statistics broken down by version. The majority of
Bioconductor Software packages are also available through other channels
such as Anaconda, who also provided download statistics for packages
installed from their repositories. Access to these counts is provided by
the anacondaDownloadStats
function:
## # A tibble: 6 × 7
## Package Year Month Nb_of_distinct_IPs Nb_of_downloads repo Date
## <chr> <chr> <chr> <int> <dbl> <chr> <date>
## 1 ABAData 2018 Apr NA 8 Anaconda 2018-04-01
## 2 ABAData 2018 Aug NA 5 Anaconda 2018-08-01
## 3 ABAData 2018 Dec NA 133 Anaconda 2018-12-01
## 4 ABAData 2018 Jul NA 6 Anaconda 2018-07-01
## 5 ABAData 2018 Jun NA 18 Anaconda 2018-06-01
## 6 ABAData 2018 Mar NA 13 Anaconda 2018-03-01
Note that Anaconda do not provide counts for distinct IP addresses, but this column is included for compatibility with the Bioconductor count tables.
The R DESCRIPTION
file contains a plethora of
information regarding package authors, dependencies, versions, etc. In a
repository such as Bioconductor, these details are available in bulk for
all included packages. The biocPkgList
returns a data.frame
with a row for each package. Tons of information are available, as
evidenced by the column names of the results.
## [1] "Package" "Version"
## [3] "Depends" "Suggests"
## [5] "License" "MD5sum"
## [7] "NeedsCompilation" "Title"
## [9] "Description" "biocViews"
## [11] "Author" "Maintainer"
## [13] "git_url" "git_branch"
## [15] "git_last_commit" "git_last_commit_date"
## [17] "Date/Publication" "source.ver"
## [19] "win.binary.ver" "mac.binary.big-sur-x86_64.ver"
## [21] "mac.binary.big-sur-arm64.ver" "vignettes"
## [23] "vignetteTitles" "hasREADME"
## [25] "hasNEWS" "hasINSTALL"
## [27] "hasLICENSE" "Rfiles"
## [29] "dependencyCount" "Imports"
## [31] "Enhances" "dependsOnMe"
## [33] "suggestsMe" "VignetteBuilder"
## [35] "URL" "SystemRequirements"
## [37] "BugReports" "importsMe"
## [39] "Archs" "LinkingTo"
## [41] "Video" "linksToMe"
## [43] "License_restricts_use" "OS_type"
## [45] "PackageStatus" "organism"
## [47] "License_is_FOSS"
Some of the variables are parsed to produce list
columns.
## # A tibble: 6 × 47
## Package Version Depends Suggests License MD5sum NeedsCompilation Title
## <chr> <chr> <list> <list> <chr> <chr> <chr> <chr>
## 1 a4 1.54.0 <chr [5]> <chr [7]> GPL-3 40f370… no Auto…
## 2 a4Base 1.54.0 <chr [2]> <chr [4]> GPL-3 477117… no Auto…
## 3 a4Classif 1.54.0 <chr [2]> <chr [4]> GPL-3 05f58f… no Auto…
## 4 a4Core 1.54.0 <chr [1]> <chr [2]> GPL-3 5d5ce9… no Auto…
## 5 a4Preproc 1.54.0 <chr [1]> <chr [4]> GPL-3 b3655d… no Auto…
## 6 a4Reporting 1.54.0 <chr [1]> <chr [2]> GPL-3 9c1304… no Auto…
## # ℹ 39 more variables: Description <chr>, biocViews <list>, Author <list>,
## # Maintainer <list>, git_url <chr>, git_branch <chr>, git_last_commit <chr>,
## # git_last_commit_date <chr>, `Date/Publication` <chr>, source.ver <chr>,
## # win.binary.ver <chr>, `mac.binary.big-sur-x86_64.ver` <chr>,
## # `mac.binary.big-sur-arm64.ver` <chr>, vignettes <list>,
## # vignetteTitles <list>, hasREADME <chr>, hasNEWS <chr>, hasINSTALL <chr>,
## # hasLICENSE <chr>, Rfiles <list>, dependencyCount <chr>, Imports <list>, …
As a simple example of how these columns can be used, extracting the
importsMe
column to find the packages that import the
GEOquery
package.
require(dplyr)
bpi = biocPkgList()
bpi %>%
filter(Package=="GEOquery") %>%
pull(importsMe) %>%
unlist()
## [1] "bigmelon" "ChIPXpress"
## [3] "DExMA" "EGAD"
## [5] "GEOexplorer" "minfi"
## [7] "Moonlight2R" "MoonlightR"
## [9] "phantasus" "recount"
## [11] "BeadArrayUseCases" "BioPlex"
## [13] "GSE13015" "healthyControlsPresenceChecker"
## [15] "easyDifferentialGeneCoexpression" "geneExpressionFromGEO"
## [17] "RCPA" "seeker"
For the end user of Bioconductor, an analysis often starts with
finding a package or set of packages that perform required tasks or are
tailored to a specific operation or data type. The
biocExplore()
function implements an interactive bubble
visualization with filtering based on biocViews terms. Bubbles are sized
based on download statistics. Tooltip and detail-on-click capabilities
are included. To start a local session:
The Bioconductor ecosystem is built around the concept of
interoperability and dependencies. These interdependencies are available
as part of the biocPkgList()
output. The
BiocPkgTools
provides some convenience functions to convert
package dependencies to R graphs. A modular approach leads to the
following workflow.
data.frame
of dependencies using
buildPkgDependencyDataFrame
.igraph
object from the dependency data frame
using buildPkgDependencyIgraph
igraph
functionality to perform arbitrary
network operations. Convenience functions,
inducedSubgraphByPkgs
and subgraphByDegree
are
available.A dependency graph for all of Bioconductor is a starting place.
## IGRAPH ecce39d DN-- 3629 29106 --
## + attr: name (v/c), edgetype (e/c)
## + edges from ecce39d (vertex names):
## [1] a4 ->a4Base a4 ->a4Preproc a4 ->a4Classif
## [4] a4 ->a4Core a4 ->a4Reporting a4Base ->a4Preproc
## [7] a4Base ->a4Core a4Classif->a4Core a4Classif->a4Preproc
## [10] ABSSeq ->methods acde ->boot aCGH ->cluster
## [13] aCGH ->survival aCGH ->multtest ACME ->Biobase
## [16] ACME ->methods ACME ->BiocGenerics ADaCGH2 ->parallel
## [19] ADaCGH2 ->ff ADaCGH2 ->GLAD ADAM ->stats
## [22] ADAM ->utils ADAM ->methods ADAMgui ->stats
## + ... omitted several edges
## + 6/3629 vertices, named, from ecce39d:
## [1] a4 a4Base a4Classif ABSSeq acde aCGH
## + 6/29106 edges from ecce39d (vertex names):
## [1] a4 ->a4Base a4 ->a4Preproc a4 ->a4Classif
## [4] a4 ->a4Core a4 ->a4Reporting a4Base->a4Preproc
See inducedSubgraphByPkgs
and
subgraphByDegree
to produce subgraphs based on a subset of
packages.
See the igraph documentation for more detail on graph analytics, setting vertex and edge attributes, and advanced subsetting.
The visNetwork package is a nice interactive visualization tool that implements graph plotting in a browser. It can be integrated into shiny applications. Interactive graphs can also be included in Rmarkdown documents (see vignette)
The full dependency graph is really not that informative to look at, though doing so is possible. A common use case is to visualize the graph of dependencies “centered” on a package of interest. In this case, I will focus on the GEOquery package.
The subgraphByDegree()
function returns all nodes and
connections within degree
of the named package; the default
degree
is 1
.
The visNework package can plot igraph
objects directly,
but more flexibility is offered by first converting the graph to
visNetwork form.
The next few code chunks highlight just a few examples of the visNetwork capabilities, starting with a basic plot.
For fun, we can watch the graph stabilize during drawing, best viewed interactively.
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
visPhysics(stabilization=FALSE)
Add arrows and colors to better capture dependencies.
data$edges$color='lightblue'
data$edges[data$edges$edgetype=='Imports','color']= 'red'
data$edges[data$edges$edgetype=='Depends','color']= 'green'
visNetwork(nodes = data$nodes, edges = data$edges, height = "500px") %>%
visEdges(arrows='from')
Add a legend.
[Work in progress]
The biocViews package is a small ontology of terms describing Bioconductor packages. This is a work-in-progress section, but here is a small example of plotting the biocViews graph.
## A graphNEL graph with directed edges
## Number of Nodes = 497
## Number of Edges = 496
The dependency burden of a package, namely the amount of
functionality that a given package is importing, is an important
parameter to take into account during package development. A package may
break because one or more of its dependencies have changed the part of
the API our package is importing or this part has even broken. For this
reason, it may be useful for package developers to quantify the
dependency burden of a given package. To do that we should first gather
all dependency information using the function
buildPkgDependencyDataFrame()
but setting the arguments to
work with packages in Bioconductor and CRAN and dependencies categorised
as Depends
or Imports
, which are the ones
installed by default for a given package.
depdf <- buildPkgDependencyDataFrame(repo=c("BioCsoft", "CRAN"),
dependencies=c("Depends", "Imports"))
dim(depdf)
## [1] 147205 3
## Package dependency edgetype
## 1 a4 a4Base Depends
## 2 a4 a4Preproc Depends
## 3 a4 a4Classif Depends
## 4 a4 a4Core Depends
## 5 a4 a4Reporting Depends
## 6 a4Base a4Preproc Depends
Finally, we call the function pkgDepMetrics()
to obtain
different metrics on the dependency burden of a package we want to
analyze, in the case below, the package BiocPkgTools
itself:
## ImportedAndUsed Exported Usage DepOverlap DepGainIfExcluded
## stats 1 453 0.22 0.01 0
## graph 1 116 0.86 0.06 0
## utils 2 228 0.88 0.01 0
## rlang 4 438 0.91 0.02 0
## igraph 9 809 1.11 0.14 4
## RBGL 1 77 1.30 0.07 0
## htmltools 1 77 1.30 0.07 0
## xml2 1 67 1.49 0.04 0
## tools 2 122 1.64 0.01 0
## stringr 1 59 1.69 0.11 0
## tibble 1 45 2.22 0.12 0
## DT 1 42 2.38 0.38 6
## rvest 1 42 2.38 0.27 2
## magrittr 1 42 2.38 0.01 0
## dplyr 8 292 2.74 0.18 0
## rorcid 1 32 3.12 0.31 8
## httr 6 91 6.59 0.09 0
## htmlwidgets 1 14 7.14 0.30 0
## jsonlite 2 23 8.70 0.01 0
## gh 1 11 9.09 0.20 4
## BiocFileCache 4 29 13.79 0.41 8
## BiocManager 3 6 50.00 0.02 0
## biocViews NA 32 NA 0.14 6
## readr NA 115 NA 0.25 6
In this resulting table, rows correspond to dependencies and columns provide the following information:
ImportedAndUsed
: number of functionality calls imported
and used in the package.Exported
: number of functionality calls exported by the
dependency.Usage
: (ImportedAndUsed
x 100) /
Exported
. This value provides an estimate of what fraction
of the functionality of the dependency is actually used in the given
package.DepOverlap
: Similarity between the dependency graph
structure of the given package and the one of the dependency in the
corresponding row, estimated as the Jaccard index
between the two sets of vertices of the corresponding graphs. Its values
goes between 0 and 1, where 0 indicates that no dependency is shared,
while 1 indicates that the given package and the corresponding
dependency depend on an identical subset of packages.DepGainIfExcluded
: The ‘dependency gain’ (decrease in
the total number of dependencies) that would be obtained if this package
was excluded from the list of direct dependencies.The reported information is ordered by the Usage
column
to facilitate the identification of dependencies for which the analyzed
package is using a small fraction of their functionality and therefore,
it could be easier remove them. To aid in that decision, the column
DepOverlap
reports the overlap of the dependency graph of
each dependency with the one of the analyzed package. Here a value
above, e.g., 0.5, could, albeit not necessarily, imply that removing
that dependency could substantially lighten the dependency burden of the
analyzed package.
An NA
value in the ImportedAndUsed
column
indicates that the function pkgDepMetrics()
could not
identify what functionality calls in the analyzed package are made to
the dependency. This may happen because pkgDepMetrics()
has
failed to identify the corresponding calls, as it happens with imported
built-in constants such as DNA_BASES
from
Biostrings
, or that although the given package is importing
that dependency, none of its functionality is actually being used. In
such a case, this dependency could be safely removed without any further
change in the analyzed package.
We can find out what actually functionality calls are we importing as follows:
## # A tibble: 1 × 2
## pkg fun
## <chr> <chr>
## 1 DT datatable
It is important to be able to identify the maintainer of a package in
a reliable way. The DESCRIPTION file for a package can include an
Authors@R
field. This field can capture metadata about
maintainers and contributors in a programmatically accessible way. Each
element of the role field of a person
(see
?person
) in the Authors@R
field comes from a
subset of the relations
vocabulary of the Library of Congress.
Metadata about maintainers can be extracted from DESCRIPTION in
various ways. As of October 2022, we focus on the use of the ORCID field
which is an optional comment
component in a
person
element. For example, in the DESCRIPTION for the
AnVIL package we have
Authors@R:
c(person(
"Martin", "Morgan", role = c("aut", "cre"),
email = "[email protected]",
comment = c(ORCID = "0000-0002-5874-8148")
),
This convention is used for a fair number of Bioconductor and CRAN packages.
We’ll demonstrate the use of get_cre_orcids
with some
packages.
inst = rownames(installed.packages())
cands = c("devtools", "evaluate", "ggplot2", "GEOquery", "gert", "utils")
totry = intersect(cands, inst)
oids = get_cre_orcids(totry)
oids
## evaluate gert utils
## NA "0000-0002-4035-0289" NA
We use the ORCID API to tabulate metadata about the holders of these IDs. We’ll avoid evaluating this because a token must be refreshed for the query to succeed.
In October 2022 the result is
> orcid_table(.Last.value)
name org
devtools Jennifer Bryan RStudio
evaluate Yihui Xie RStudio, Inc.
ggplot2 Thomas Lin Pedersen RStudio
GEOquery Sean Davis University of Colorado Anschutz Medical Campus
gert Jeroen Ooms Berkeley Institute for Data Science
utils <NA> <NA>
city region country orcid
devtools Boston MA US 0000-0002-6983-2759
evaluate Elkhorn NE US 0000-0003-0645-5666
ggplot2 Copenhagen <NA> DK 0000-0002-5147-4711
GEOquery Aurora Colorado US 0000-0002-8991-6458
gert Berkeley CA US 0000-0002-4035-0289
utils <NA> <NA> <NA> <NA>
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] biocViews_1.75.0 visNetwork_2.1.2 igraph_2.1.1
## [4] dplyr_1.1.4 BiocPkgTools_1.25.2 htmlwidgets_1.6.4
## [7] knitr_1.49 BiocStyle_2.35.0
##
## loaded via a namespace (and not attached):
## [1] xfun_0.49 bslib_0.8.0 gh_1.4.1
## [4] Biobase_2.67.0 tzdb_0.4.0 vctrs_0.6.5
## [7] tools_4.4.2 bitops_1.0-9 generics_0.1.3
## [10] stats4_4.4.2 curl_6.0.1 RUnit_0.4.33
## [13] tibble_3.2.1 fansi_1.0.6 RSQLite_2.3.8
## [16] blob_1.2.4 pkgconfig_2.0.3 dbplyr_2.5.0
## [19] graph_1.85.0 lifecycle_1.0.4 stringr_1.5.1
## [22] compiler_4.4.2 htmltools_0.5.8.1 sys_3.4.3
## [25] buildtools_1.0.0 sass_0.4.9 RCurl_1.98-1.16
## [28] yaml_2.3.10 pillar_1.9.0 jquerylib_0.1.4
## [31] whisker_0.4.1 DT_0.33 cachem_1.1.0
## [34] rvest_1.0.4 tidyselect_1.2.1 digest_0.6.37
## [37] stringi_1.8.4 purrr_1.0.2 maketools_1.3.1
## [40] fastmap_1.2.0 cli_3.6.3 magrittr_2.0.3
## [43] RBGL_1.83.0 XML_3.99-0.17 crul_1.5.0
## [46] utf8_1.2.4 withr_3.0.2 readr_2.1.5
## [49] filelock_1.0.3 bit64_4.5.2 rmarkdown_2.29
## [52] httr_1.4.7 bit_4.5.0 hms_1.1.3
## [55] memoise_2.0.1 evaluate_1.0.1 BiocFileCache_2.15.0
## [58] rlang_1.1.4 glue_1.8.0 DBI_1.2.3
## [61] httpcode_0.3.0 BiocManager_1.30.25 xml2_1.3.6
## [64] fauxpas_0.5.2 BiocGenerics_0.53.3 rorcid_0.7.0
## [67] jsonlite_1.8.9 R6_2.5.1