The gpuMagic package is designed for ultilizing the computing power of modern GPU. The package has the ability to compile R code and implement the code on a GPU kernel. Therefore, it is easy for the user to design their own algorithm in pure R language and run the code in GPU. The package uses openCL as its parallel language so parallelization using the CPU is also supported.
To get started, it is always a good idea to check your GPU avalability.
getDeviceList()
#> No device has been found, please make sure the computer has a graphic card or the driver has been properly installed.
#> Hint:
#> For CPU, you can install the intel's / ATI's graphic driver for the intel's / AMD's CPU respectively.
#> For GPU, you need to download the graphic driver from your vendor's website.
#> No device has been found!
#> NULL
The devices listed above show you the current avalable devices, if you cannot see the device, you should install the openCL driver obtained from the corresponding vender. The default device is the device 1, if you have multiple devices, you can query and change your device by
getDeviceInfo(1)
#> No device has been found, please make sure the computer has a graphic card or the driver has been properly installed.
#> Hint:
#> For CPU, you can install the intel's / ATI's graphic driver for the intel's / AMD's CPU respectively.
#> For GPU, you need to download the graphic driver from your vendor's website.
#> No device has been found!
#> NULL
setDevice(1)
#> No device has been found!
#> NULL
The package also has a GPU resources manager. You can check your memory usage at any time by
After the proper device has been selected, you are ready to write your GPU code.
###Warning If you are using a Nvidia graphics card, it is a good practice to change your Timeout Detection and Recovery(TDR) setting as the driver will reset the graphics card if the function takes too long to run the code. The default setting is 2 seconds, which is too short for most operations. If you are using another device such as an Intel graphic card, you are safe to continue.
The setting can be found in NVIDIA Nsight. Change it to a larger number such as 20 seconds will make most programs work.
The GPU code should be written as an R function and should be compatible with the sapply function. Here is an example of matrix multiplication
#C=A%*%B
#Each time the function will compute a column of C
matMul<-function(ind,A,B){
C=A%*%B[,ind]
return(C)
}
You can test your function by using the sapply function
n=10
m=50
k=100
A=matrix(runif(n*m),n,m)
B=matrix(runif(m*k),m,k)
C_sapply=sapply(1:k,matMul,A,B)
#The internal matrix operation function in R
C_internel=A%*%B
#Compute error
max(abs(C_sapply-C_internel))
#> [1] 1.065814e-14
As you see here, the result from sapply is consistent with the result from the R internal matrix multiplication function. To run the above function in GPU, it can be simply achieved by changing sapply to gpuSapply
C_gpu=gpuSapply(1:k,matMul,A,B)
#> No device has been found! Auto dispatch the gpuSapply function to sapply
#Compute error
max(abs(C_gpu-C_internel))
#> [1] 1.065814e-14
That is all you need to do to run the GPU code! We can try a larger sample size to see the speed differences between GPU and R code.
n=1000
m=1000
k=1000
A=matrix(runif(n*m),n,m)
B=matrix(runif(n*m),m,k)
system.time(
gpuSapply(1:k,matMul,A,B)
)
#> No device has been found! Auto dispatch the gpuSapply function to sapply
#> user system elapsed
#> 1.011 1.049 0.516
system.time(
A%*%B
)
#> user system elapsed
#> 0.093 0.002 0.025
In my test environment(GF 940m), the above code takes about 0.1 secs and CPU(i5-6300U) takes 0.6 secs to finish, a 6 times speed up in the matrix operation!
Here we discuss some common questions that you will have when using the package. Some are due to the limitations of openCL language and some are due to the lazyness of the package author.
##Important notes + All objects in openCL are either a scalar or a
matrix, vector is not supported + NA
and NULL
values are not supported, I did not test them yet but it should be OK to
pass them to the device, it just may not work as you expect. This
feature will be added in the near future. + Do not allocate(create)
variables in for-loops or if-statements, it may cause an unexpected
error. Please allocate them before the loops or if-statements. + Be
careful when you do the self assignments, the code like
A[1:2]=A[2:1]
does not work, since the opencl does not have
a temporary space to store the intermediate results and therefore the
assignments will be excuted in sequence, so the above code is equivalent
to A[1]=A[2];A[2]=A[1]
. The value in A[2]
will
be incorrect. + For saving the memory space, the function arguments are
shared by all threads. Therefore changing the value of the function
arguments is depreciated unless the change made by one thread will not
be read by the other threads. +If you find any bugs in the package that
prevent you from running the openCL code, a temporary solution would be
using the scalar opration, because most bugs are introduced by the poor
design of the matrix operation. ## Dynamic allocation? Nope! In R
language, it is fairly easy to create a matrix without knowing its size
in advance, for example,
will create an n-by-n matrix A. The size of A is undetermined until the first line of the code is evaluated. The ability to create a matrix without knowing its size before running the code is called dynamic allocation. Unfortunately, due to the language limitation, openCL does not support this feature, which means the size of every matrix in the openCL code should be known before its excution. Therefore, it is impossible to create a matrix dynamically. To solve this problem, the package provides two alternative ways to make a dynamic-like matrix.
The gpuSapply
function has an argument
.macroParms
to specify the macro variable. Therefore, the
value of the macro will be plugged into the function and is known for
the compiler. for example
dynamic_example1<-function(ind,A,n){
B=matrix(0,n)
C=A+B
return(C)
}
n=10
A=matrix(runif(n),n)
#The code below wouldn't work,
#the value of n is unknown in the compilation stage
#res_device=gpuSapply(1,dynamic_example1,A,n)
#works, the value of n is knwown as a macro in the compilation stage
res_device=gpuSapply(1,dynamic_example1,A,n,.macroParms = c("n"))
#> No device has been found! Auto dispatch the gpuSapply function to sapply
Please note that when you specify a variable as a macro, you cannot change the value of it inside the function since it is fixed
If you do need a dynamic feature in your code, you can use the reference function to create it. The reference function works like a macro. For example
#Perform the column sum of the first n elements in each column of A
dynamic_example2<-function(ind,A,n){
#Dynamically subsetting the first n elements in the ind column of the matrix A
#Almost equivalent to A_dyn=A[1:n,ind]
#The size of A_dyn is unknown in the compilation stage since the value of n is unknown
A_dyn=subRef(A,1:n,ind)
C=sum(A_dyn)
return(C)
}
n=10
M=100
A=matrix(runif(M),M,M)
res_device=gpuSapply(1:M,dynamic_example2,A,n)
#> No device has been found! Auto dispatch the gpuSapply function to sapply
res_cpu=sapply(1:M,dynamic_example2,A,n)
#check error
range(res_device-res_cpu)
#> [1] 0 0
With the reference function, you can allocate a large matrix and then
dynamically subset it to create the matrix with the size you need.
However, you need to be careful when you do the operations on the
reference object. In the example above, any changes to the reference
object A_dyn
will cause the change in its target object
A
. Please refer to the function list to see the available
reference functions.
##Dynamic return Since dynamic allocation is not possible, you may
wonder if returning an object with dynamic size is possible. The answer
is yes but it is depreciated. If you write multiple return
functions in your code. The package will first compute the maximum
length of your return variables, and then allocate the memory according
to the maximum length. If the length of one return variable cannot be
correctly infered, you will get a warning message, and the length of
that variable will be ignored. Therefore, in order to correctly return a
variable of undetermined length, you should create a matrix which size
is always larger than the return variable, then you return that matrix
in the end. Therefore the compiler will always have enough space for
your return variable( but with redundant space). For example:
dynamic_return<-function(ind,A,n){
#Dynamically subsetting the first n elements in the ind column of the matrix A
#equivalent to A_dyn=A[1:n,ind]
A_dyn=subRef(A,1:n,ind)
#return A_dyn, the size is unknown
return(A_dyn)
#The largest length of the variable A_dyn
tmp=matrix(0,nrow(A))
#tell the compiler the return size should be at least the same as the variable tmp
return(tmp)
}
n=10
M=20
A=matrix(runif(M*M),M,M)
#Expect a warning message: Undetermined return size has been found
suppressWarnings({
res_device=gpuSapply(1:M,dynamic_return,A,n)
})
#> No device has been found! Auto dispatch the gpuSapply function to sapply
#retrieve the result (The first n rows)
res_device=res_device[1:n,]
#check error
range(res_device-A[1:n,])
#> [1] 0 0
In the future, the package will provide a more concise way to specify the return size.
##Oprations between two matrice with different size? Not implemented.
The language used in the package is only a subset of the R language, it
fully supports scalar to scalar, matrix to matrix, scalar to matrix, and
matrix to scalar opration. However, for the matrix to matrix operations,
they must have the same size, otherwise an error will be given(eg.
matrix + vector is not allowed). An alternative solution for the matrix
to vector operations is using the sweep
function, please
refer to the R documentation to see the usage.
##Error checking For performance reasons, the package only performs
the error checking in the compilation stage. Therefore, when an error
occurs on GPU, the code will continue to run silently, and crash R at
some point without any hint. It is always a good idea to run a sample
code with thesapply
function to check for possible errors
before you launch the rocker with your precious data.
##Sequence variable For the sequence variable, you can create it
without limitation. As you see in the dynamic_example2
function, we have implicitly created a sequence variable
1:n
, whose size is unknown. You can use the function
seq
or :
to create the sequence. the only
exception is that you cannot change the value of the sequence variable,
because the variable is in a compact format and does not own a memory.
If you do want to create a sequence variable in memory, an alternative
way is to create the sequence first and assign it to a new variable,
then the compiler will allocate the memory for the new variable.
##Calling another user-defined R function inside the R function The current version does not support calling user-defined R functions inside your main function, this feature is under development and will come in the near future.
#More examples
Here we provide more examples to show the usage of the package.
Please note that not all the functions will be efficient when running on
the GPU, it is possible that the CPU code can be faster than the GPU.
Depending on the job type, you should decide which device you want to
run on by calling setDevice()
.
##Finding the indices of largest k values in each column of A
largestK<-function(ind,A,k){
#get a column of A
A_col=subRef(A,,ind)
#the largest k values and indices of A_col
largest_value=matrix(0,k)
largest_ind=matrix(0,k)
#loop over A_col to find the largest k values
for(i in 1:length(A_col)){
for(j in 1:k){
if(A_col[i]>largest_value[j]){
if(j!=k){
#Backward assignment to avoid self assignment problem
#(see section: Important note)
ind_src=(k-1):j
largest_value[ind_src+1]=largest_value[ind_src]
largest_ind[ind_src+1]=largest_ind[ind_src]
}
largest_value[j]=A_col[i]
largest_ind[j]=i
break
}
}
}
return_nocpy(largest_ind)
}
N=1000
M=1000
k=10
A=matrix(runif(N*M),N,M)
#Warmup
warm_res=gpuSapply(1:M,largestK,A=A,k=k,.macroParms = "k")
#> No device has been found! Auto dispatch the gpuSapply function to sapply
warm_res=sapply(1:M,largestK,A=A,k=k)
#count the time
system.time({res_device=gpuSapply(1:M,largestK,A=A,k=k,.macroParms = "k")})
#> No device has been found! Auto dispatch the gpuSapply function to sapply
#> user system elapsed
#> 0.652 0.000 0.652
system.time({res_cpu=sapply(1:M,largestK,A=A,k=k)})
#> user system elapsed
#> 0.644 0.000 0.644
#Check the error
range(res_device-res_cpu)
#> [1] 0 0
##Brute-force method to find the solution of the linear regression with L1 loss function
computeLoss<-function(ind,x,y,parms){
#Find the parameters for the thread
parm=parms[ind,]
#Compute y hat, use no-copy transpose
y_hat=x%*%t_nocpy(parm)
#absolute loss value(L1 loss)
loss=sum(abs(y-y_hat))
return(loss)
}
#Sample size
n=1000
#Number of parameters
p=2
beta=c(2,3)
#Generate data
x=matrix(runif(n*p),n,p)
e=runif(n)
y=x%*%beta+e
#Brute-force search
#The seaching range and precision
search_range=seq(0,10,0.1)
parms=expand.grid(search_range,search_range)
#Warmup
warm_res=gpuSapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms)
#> No device has been found! Auto dispatch the gpuSapply function to sapply
warm_res=sapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms)
#count the time
system.time({res_device=gpuSapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms)})
#> No device has been found! Auto dispatch the gpuSapply function to sapply
#> user system elapsed
#> 0.793 0.000 0.793
system.time({res_cpu=sapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms)})
#> user system elapsed
#> 0.776 0.000 0.776
#print out the parameters which minimize the loss function
parms[which.min(res_device),]
#> Var1 Var2
#> 3460 2.5 3.4
parms[which.min(res_cpu),]
#> Var1 Var2
#> 3460 2.5 3.4
#Supported functions The package currently supports a limited number of functions, the usage of them are similar but not promised to be the same as the R functions. For example, the drop argument in [ function is not supported yet. Therefore, in the current stage, it is better to only call the simplest form of the function. The current available functions are:
Function | Note |
---|---|
nrow | the value can be used as a dimensional parameter to create a matrix |
ncol | the value can be used as a dimensional parameter to create a matrix |
length | |
matrix | |
[ | drop does not work |
abs | |
floor | |
ceiling | |
t | |
sum | Supports one argument only |
rowSums | Supports one argument only |
colSums | Supports one argument only |
rowMeans | Supports one argument only |
colMeans | Supports one argument only |
sort | Supports one argument only, ascending sort |
sweep | |
+ | |
- | |
* | |
/ | |
^ | |
> | |
>= | |
< | |
<= | |
== | |
%*% | |
return |
Explicit declaration function | note |
---|---|
subRef | Create a reference object to a subset of a matrix, the usage is same as [ function |
seq | Support from, to, by,length.out arguments only |
: | |
Matrix | Declare a matrix(Depreciated) |
Scalar | Declare a scalar(Depreciated) |
Besides traditional R function, the package also provides a few no-copy methods. These methods do not allocate memory for the result. Instead, they return a referent(AKA pointer) pointing to the target matrix. Therefore, any changes made in the reference object will cause the changes in the target matrix. This feature is for saving memory usage, but it is could be very useful when you need “dynamic” allocation. You can first allocate a matrix with fixed dimension and then create a reference object pointing to that memory space, the size of the reference object can be undetermined in the compilation stage.
Function | Note |
---|---|
subRef | No-copy subsetting, the usage is same as [ |
t_nocpy | No-copy transpose, it return a reference of the transpose of a matrix |
return_nocpy | No-copy return, return the variable without an additional copy, it does not support reference object |
#R version and Session info Here are the platform information used to compile this vignette.
version
#> _
#> platform x86_64-pc-linux-gnu
#> arch x86_64
#> os linux-gnu
#> system x86_64, linux-gnu
#> status
#> major 4
#> minor 4.2
#> year 2024
#> month 10
#> day 31
#> svn rev 87279
#> language R
#> version.string R version 4.4.2 (2024-10-31)
#> nickname Pile of Leaves
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] gpuMagic_1.23.0 BiocStyle_2.35.0
#>
#> loaded via a namespace (and not attached):
#> [1] utf8_1.2.4 sass_0.4.9 generics_0.1.3
#> [4] class_7.3-22 stringi_1.8.4 lattice_0.22-6
#> [7] hms_1.1.3 digest_0.6.37 magrittr_2.0.3
#> [10] evaluate_1.0.1 grid_4.4.2 mvtnorm_1.3-2
#> [13] fastmap_1.2.0 cellranger_1.1.0 jsonlite_1.8.9
#> [16] Matrix_1.7-1 e1071_1.7-16 BiocManager_1.30.25
#> [19] httr_1.4.7 fansi_1.0.6 DescTools_0.99.58
#> [22] codetools_0.2-20 jquerylib_0.1.4 cli_3.6.3
#> [25] rlang_1.1.4 expm_1.0-0 withr_3.0.2
#> [28] cachem_1.1.0 yaml_2.3.10 rootSolve_1.8.2.4
#> [31] tools_4.4.2 pryr_0.1.6 lmom_3.2
#> [34] gld_2.6.6 Exact_3.3 forcats_1.0.0
#> [37] boot_1.3-31 Deriv_4.1.6 BiocGenerics_0.53.3
#> [40] buildtools_1.0.0 vctrs_0.6.5 R6_2.5.1
#> [43] proxy_0.4-27 lifecycle_1.0.4 stringr_1.5.1
#> [46] MASS_7.3-61 pkgconfig_2.0.3 bslib_0.8.0
#> [49] pillar_1.9.0 data.table_1.16.2 glue_1.8.0
#> [52] Rcpp_1.0.13-1 haven_2.5.4 xfun_0.49
#> [55] tibble_3.2.1 rstudioapi_0.17.1 sys_3.4.3
#> [58] knitr_1.49 htmltools_0.5.8.1 rmarkdown_2.29
#> [61] maketools_1.3.1 compiler_4.4.2 readxl_1.4.3