Packages like reticulate facilitate the use of Python modules in our R-based data analyses, allowing us to leverage Python’s strengths in fields such as machine learning and image analysis. However, it is notoriously difficult to ensure that a consistent version of Python is available with a consistently versioned set of modules, especially when the system installation of Python is used. As a result, we cannot easily guarantee that some Python code executed via reticulate on one computer will yield the same results as the same code run on another computer. It is also possible that two R packages depend on incompatible versions of Python modules, such that it is impossible to use both packages at the same time. These versioning issues represent a major obstacle to reliable execution of Python code across a variety of systems via R/Bioconductor packages.
basilisk
uses Conda (or more specifically, Miniforge) to
provision a Python instance that is fully managed by the Bioconductor
installation machinery. This provides developers of downstream
Bioconductor packages with more control over their Python environment,
most typically by the creation of package-specific conda
environments containing all of their required Python packages.
Additionally, basilisk
provides utilities to manage different Python environments within a
single R session, enabling multiple Bioconductor packages to use
incompatible versions of Python packages in the course of a single
analysis. These features enable reproducible analysis, simplify
debugging of code and improve interoperability between compliant
packages.
The son.of.basilisk package (provided in the
inst/example
directory of this package) is provided as an
example of how one might write a client package that depends on basilisk.
This is a fully operational example package that can be installed and
run, so prospective developers should use it as a template for their own
packages. We will assume that readers are familiar with general R
package development practices and will limit our discussion to the
basilisk-specific
elements.
StagedInstall: no
should be set, to ensure that Python
packages are installed with the correct hard-coded paths within the R
package installation directory.
Imports: basilisk
should be set along with appropriate
directives in the NAMESPACE
for all basilisk
functions that are used.
.BBSoptions
should contain
UnsupportedPlatforms: win32
, as builds on Windows 32-bit
are not supported.
BasiliskEnvironment
objectsTypically, we need to create some environments with the requisite
packages using the conda
package manager. A
basilisk.R
file should be present in the R/
subdirectory containing commands to produce a
BasiliskEnvironment
object. These objects define the Python
environments to be constructed by basilisk
on behalf of your client package.
library(basilisk)
my_env <- BasiliskEnvironment(envname="my_env_name",
pkgname="ClientPackage",
packages=c("pandas==1.4.3", "scikit-learn==1.1.1")
)
second_env <- BasiliskEnvironment(envname="second_env_name",
pkgname="ClientPackage",
packages=c("scipy=1.9.1", "numpy==1.22.1")
)
As shown above, all listed Python packages should have valid version
numbers that can be obtained by conda
. It is strongly
recommended to explicitly list the versions of any dependencies so as to
future-proof the installation process. If the package versions are not
known, we suggest using setBasiliskCheckVersions(FALSE)
and
listPackages()
to identify the appropriate versions.
If a different version of Python is required, it should be explicitly
listed in the packages=
, e.g., with
python=3.7
. Otherwise, basilisk
will automatically use the conda
-provided version of Python
(here, 3.10.14) in all environments. It is often a good idea to
explicitly list a version of Python in packages=
, even if
it is already version-compatible with the default; this ensures that the
environment creation is robust to adminstrator overrides of the conda
instance (see below).
By default, basilisk
will fetch Conda packages from the conda-forge
channel (see
setupBasiliskEnv()
for details). This can be modified
through the channels=
argument in the
BasiliskEnvironment()
constructor, though some care is
required with respect to licensing. In particular, the default Anaconda
channel (i.e.,defaults
) is not free for commercial usage,
which restricts the usability of any Bioconductor packages that use
Conda packages from this channel.
It is also possible to install packages from PyPi via the
pip=
argument, but this should be done with much
caution.
An executable configure
file should be created in the
top level of the client package, containing the command shown below.
This enables creation of environments during basilisk
installation if BASILISK_USE_SYSTEM_DIR
is set.
For completeness, configure.win
should also be
created:
Note that basilisk.R
should be executable as a
standalone file and create all BasiliskEnvironment
s as
named variables in the current environment. This is because the file
will be directly source
d by
configureBasiliskEnv()
for system installation of the
environments. As such, the file should not assume that the rest of the
client package has been installed or that the client’s various
dependencies have been loaded.
Any R functions that use Python code should do so via
basiliskRun()
, which ensures that different Bioconductor
packages play nice when their dependencies clash. To use methods from
the my_env
environment that we previously defined, the
functions in our hypothetical ClientPackage package should
define functions like:
my_example_function <- function(ARG_VALUE_1, ARG_VALUE_2) {
proc <- basiliskStart(my_env)
on.exit(basiliskStop(proc))
some_useful_thing <- basiliskRun(proc, fun=function(arg1, arg2) {
mod <- reticulate::import("scikit-learn")
output <- mod$some_calculation(arg1, arg2)
# The return value MUST be a pure R object, i.e., no reticulate
# Python objects, no pointers to shared memory.
output
}, arg1=ARG_VALUE_1, arg2=ARG_VALUE_2)
some_useful_thing
}
In the above chunk, a developer-defined function fun
is
passed to basiliskRun()
for execution inside the
proc
context where the specified conda environment is
loaded. Developers should not make any assumptions about the nature of
proc
, which is dependent on the state of the R session. For
example, basilisk
may choose to run fun
in the current R session, or in
another process with the same R installation, or even in another process
with a different R installation.
basiliskStart()
will lazily install conda and the Python
packages in my_env
if they are not already present. This
can result in some delays when any function using basilisk
is first called; afterwards, the installed environments will simply be
re-used.
Developers should respect several constraints when defining a
function for use in basiliskRun()
:
basiliskRun()
may
execute in a different process such that any pointers are no longer
valid when they are transferred back to the parent process. Both the
arguments to the function passed to basiliskRun()
and its
return value MUST be amenable to serialization.::
operator. This ensures that the relevant package will be
loaded during function execution in a separate process. Note that
certain “problematic” Python packages may preclude the use of all
non-base functions altogether, see comments on the “last-resort
fallback” in ?basiliskStart
.More details on acceptable function definitions are provided in
?basiliskRun
. Developers can check that their function
behaves correctly in a different process by setting
setBasiliskShared(FALSE)
and
setBasiliskFork(FALSE)
prior to running
basiliskRun()
in their unit tests.
Developers can persist variables across multiple calls to
basiliskRun()
by setting persist=TRUE
. This
instructs basiliskRun()
to pass along an environment to
fun
as the store=
argument, which can be used
inside fun
to set or get variables if
basiliskRun()
is called with the same proc
.
Stored variables are not subject to the restrictions on the
arguments/return value of fun
, but they are strictly
internal to any instance of proc
.
my_example_function <- function() {
proc <- basiliskStart(my_env)
on.exit(basiliskStop(proc))
basiliskRun(proc, fun=function(store) {
store$something <- rand(1)
invisible(NULL)
}, persist=TRUE)
basiliskRun(proc, fun=function(store) {
store$something
}, persist=TRUE)
}
This capability allows developers to modularize complex Python
workflows by splitting up steps across multiple calls to
basiliskRun()
. However, it is probably unwise to re-use
proc
across user-visible functions, i.e., the end user
should never have an opportunity to interact with proc
.
In most cases, end users should not have to read this document. Properly configured basilisk clients should handle all aspects of Python environment creation and loading without requiring user intervention. That said, some system configurations are less cooperative than others: this section contains a list of known issues and possible fixes.
Historically, the Miniforge installers did not work if the
installation directory contained spaces. Recent Conda versions have
improved support for spaces but mileage may vary in the packages. If
spaces are causing problems in the default installation path (check
basilisk.utils::getExternalDir()
), consider setting the BASILISK_EXTERNAL_DIR
environment variable to a location without spaces.
Windows has a limit of 260 characters for its file paths. This is
occasionally exceeded due to deeply nested directories for some
packages, causing the conda installation to be incomplete or fail
outright. Furthermore, the Windows Miniforge installer has a limit of 46
characters for the top-level Conda installation directory. If either of
these constraints are causing problems, it may be possible to circumvent
them by setting BASILISK_EXTERNAL_DIR
to a shorter
path.
Builds for 32-bit Windows are not supported due to a lack of demand relative to the difficulty of setting it up.
Older versions of Rstudio on MacOSX have some difficulties with the generation of separate processes (see here). As a workaround in such cases, users should set:
Conda installations and environments tend to use a lot of disk space, so basilisk will automatically attempt to remove its old conda installations as well as unused environments for each client package. However, this removal may not be fast enough on systems with low disk usage quotas, resulting in incomplete or failed installations. In such cases, users can forcibly clear the external directory themselves to free up some space:
# Remove obsolete environments for specific package:
basilisk.utils::clearExternalDir(package = "pkg_name", obsolete.only = TRUE)
# Remove all environments for a specific package:
basilisk.utils::clearExternalDir(package = "pkg_name")
# Remove all basilisk-managed conda installations and environments:
basilisk.utils::clearExternalDir()
IMPORTANT! the automatic installation brokered by basilisk assumes that the user accepts the Anaconda terms of service. This mostly boils down to “don’t hammer the Anaconda repository with requests”, see the commentary here.
Administrators of an R installation can modify the behavior of basilisk by setting a few environment variables. All environment variables described here must be set at both installation time and run time to have any effect. If any value is changed, it is generally safest to reinstall basilisk and all of its clients.
Setting the BASILISK_EXTERNAL_DIR
environment variable
will change where the conda instance and environments are placed by basiliskStart()
during lazy
installation. This is usually unnecessary unless the default path
contains spaces or the combination of the default location and conda’s
directory structure exceeds the file path length limit on Windows.
Setting BASILISK_USE_SYSTEM_DIR
to 1
will
instruct basilisk
to install the conda instance in the R system directory during R package
installation. Similarly, all (correctly configure
d) client
packages will install their environments in the corresponding system
directory when they themselves are being installed. This is very useful
for enterprise-level deployments as the conda instances and environments
are (i) not duplicated in each user’s home directory, and (ii) always
available to any user with access to the R installation. However, it
requires installation from source and thus is not set by default.
It is possible to direct basilisk
to use an existing Conda installation, by setting the
BASILISK_EXTERNAL_CONDA
environment variable to an absolute
path to the conda installation directory. This may be desirable to avoid
redundant copies of the same conda installation. Any replacement conda
instance should be using a similar version of Python as basilisk’s
default (3.10.14) to satisfy client packages that rely on the implict
pinning.
Setting BASILISK_MINIFORGE_VERSION
will change the
Miniforge version installed by basilisk.
This should be set to the name of the tagged releases on the Miniforce
installation page, .e.g, "24.3.0-0"
.
Setting BASILISK_NO_DESTROY
to 1
will
instruct basilisk
to not destroy previous conda instances and environments upon
installation of a new version of basilisk.
This destruction is done by default to avoid accumulating many large
obsolete conda instances. However, it is not desirable if there are
multiple R instances running different versions of basilisk
from the same Bioconductor release, as installation by one R instance
would delete the installed content for the other. (Multiple R instances
running different Bioconductor releases are not affected.) This option
has no effect if BASILISK_USE_SYSTEM_DIR
is set.
Setting BASILISK_NO_FALLBACK_R
to 1
will
instruct basilisk
to not create an conda-based R installation as a last-resort
fallback (see comments in ?basiliskStart
). This avoids the
overhead of an internal R installation if it is known that shared
library version conflicts will not occur. It is most useful for
streamlining applications where developers can (i) test that no
conflicts occur in the installed subset of basilisk
clients and/or (ii) control the versions of system libraries used by R.
This option only has an effect if BASILISK_USE_SYSTEM_DIR
is set, as otherwise the fallback R installation is only created upon
encountering version conflicts.
While basilisk
is primarily intended for package developers, end users can also take
advantage of its graceful handling of multiple Python environments in
complex workflows. For example, we can easily instantiate a conda
environment in our working directory with
createLocalBasiliskEnv()
:
if (.Platform$OS.type != "windows") {
tmp <- createLocalBasiliskEnv("basilisk-vignette-test",
packages=c("scikit-learn=1.1.1", "numpy=1.22.1"))
}
We can then supply this environment’s path to
basiliskRun()
to execute Python-based calculations. To
demonstrate, we’ll apply scikit-learn’s
truncated PCA on a random matrix. Note that the restrictions mentioned
above for fun
are still
applicable here.
if (.Platform$OS.type != "windows") {
x <- matrix(rnorm(1000), ncol=10)
basiliskRun(env=tmp, fun=function(mat) {
module <- reticulate::import("sklearn.decomposition")
runner <- module$TruncatedSVD()
output <- runner$fit(mat)
output$singular_values_
}, mat = x, testload="scipy.optimize")
}
## [1] 12.74465 11.62975
basiliskRun()
can also be used with conda environments
constructed outside of basilisk. Of course, in this
case, it is the user’s responsibility to ensure that the environment is
correctly provisioned. Some caution is also required as basilisk
is not guaranteed to work with environments created by a different
version of conda, though problems seem to be rare in practice.
if (.Platform$OS.type != "windows") {
library(reticulate)
# In this case, we'll use reticulate directly to construct our conda
# environment; though we'll cheat a little and use basilisk's conda
# installation, otherwise reticulate will try to install its own miniconda.
tmp2 <- file.path(getwd(), "basilisk-vignette-test2")
if (!file.exists(tmp2)) {
conda.bin <- file.path(
basilisk.utils::getCondaDir(),
basilisk.utils::getCondaBinary()
)
conda_install(tmp2,
packages=c("scipy==1.9.1"),
python_version="3.10",
channels="conda-forge",
additional_create_args="--override-channels",
additional_install_args="--override-channels",
conda=conda.bin
)
}
basiliskRun(env=tmp2, fun=function(mat) {
module <- reticulate::import("scipy.stats")
norm <- module$norm
norm$cdf(c(-1, 0, 1))
}, mat = x, testload="scipy.optimize")
}
## [1] 0.1586553 0.5000000 0.8413447
It is worth highlighting the fact that we were able to call
basiliskRun()
successfully on two different environments
within the same R session. This enables the construction of complex
analysis workflows that span across R and multiple Python
environments.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] basilisk_1.19.0 reticulate_1.39.0 BiocStyle_2.33.1
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.3 knitr_1.48 rlang_1.1.4
## [4] xfun_0.48 png_0.1-8 jsonlite_1.8.9
## [7] dir.expiry_1.13.1 buildtools_1.0.0 htmltools_0.5.8.1
## [10] maketools_1.3.1 sys_3.4.3 sass_0.4.9
## [13] rmarkdown_2.28 grid_4.4.1 filelock_1.0.3
## [16] evaluate_1.0.1 jquerylib_0.1.4 fastmap_1.2.0
## [19] yaml_2.3.10 lifecycle_1.0.4 BiocManager_1.30.25
## [22] compiler_4.4.1 Rcpp_1.0.13 lattice_0.22-6
## [25] digest_0.6.37 R6_2.5.1 parallel_4.4.1
## [28] bslib_0.8.0 Matrix_1.7-1 withr_3.0.2
## [31] tools_4.4.1 basilisk.utils_1.17.3 cachem_1.1.0