gpuMagic quick start guide

Introduction

The gpuMagic package is designed for ultilizing the computing power of modern GPU. The package has the ability to compile R code and implement the code on a GPU kernel. Therefore, it is easy for the user to design their own algorithm in pure R language and run the code in GPU. The package uses openCL as its parallel language so parallelization using the CPU is also supported.

Program setting

To get started, it is always a good idea to check your GPU avalability.

getDeviceList()
#> No device has been found, please make sure the computer has a graphic card or the driver has been properly installed.
#> Hint:
#> For CPU, you can install the intel's / ATI's graphic driver for the intel's / AMD's CPU respectively.
#> For GPU, you need to download the graphic driver from your vendor's website.
#> No device has been found!
#> NULL

The devices listed above show you the current avalable devices, if you cannot see the device, you should install the openCL driver obtained from the corresponding vender. The default device is the device 1, if you have multiple devices, you can query and change your device by

getDeviceInfo(1)
#> No device has been found, please make sure the computer has a graphic card or the driver has been properly installed.
#> Hint:
#> For CPU, you can install the intel's / ATI's graphic driver for the intel's / AMD's CPU respectively.
#> For GPU, you need to download the graphic driver from your vendor's website.
#> No device has been found!
#> NULL
setDevice(1)
#> No device has been found!
#> NULL

The package also has a GPU resources manager. You can check your memory usage at any time by

gpuMagic.getMemUsage()
#> Device memory usage:
#>  --Used: (%) --Total:

After the proper device has been selected, you are ready to write your GPU code.

###Warning If you are using a Nvidia graphics card, it is a good practice to change your Timeout Detection and Recovery(TDR) setting as the driver will reset the graphics card if the function takes too long to run the code. The default setting is 2 seconds, which is too short for most operations. If you are using another device such as an Intel graphic card, you are safe to continue.

The setting can be found in NVIDIA Nsight. Change it to a larger number such as 20 seconds will make most programs work.

GPU code

The GPU code should be written as an R function and should be compatible with the sapply function. Here is an example of matrix multiplication

#C=A%*%B
#Each time the function will compute a column of C
matMul<-function(ind,A,B){
  C=A%*%B[,ind]
  return(C)
}

You can test your function by using the sapply function

n=10
m=50
k=100
A=matrix(runif(n*m),n,m)
B=matrix(runif(m*k),m,k)
C_sapply=sapply(1:k,matMul,A,B)
#The internal matrix operation function in R
C_internel=A%*%B
#Compute error
max(abs(C_sapply-C_internel))
#> [1] 8.881784e-15

As you see here, the result from sapply is consistent with the result from the R internal matrix multiplication function. To run the above function in GPU, it can be simply achieved by changing sapply to gpuSapply

C_gpu=gpuSapply(1:k,matMul,A,B)
#> No device has been found! Auto dispatch the gpuSapply function to sapply
#Compute error
max(abs(C_gpu-C_internel))
#> [1] 8.881784e-15

That is all you need to do to run the GPU code! We can try a larger sample size to see the speed differences between GPU and R code.

n=1000
m=1000
k=1000
A=matrix(runif(n*m),n,m)
B=matrix(runif(n*m),m,k)
system.time(
  gpuSapply(1:k,matMul,A,B)
)
#> No device has been found! Auto dispatch the gpuSapply function to sapply
#>    user  system elapsed 
#>   0.995   1.122   0.530
system.time(
  A%*%B
)
#>    user  system elapsed 
#>   0.093   0.005   0.025

In my test environment(GF 940m), the above code takes about 0.1 secs and CPU(i5-6300U) takes 0.6 secs to finish, a 6 times speed up in the matrix operation!

Known issues & Keys to success

Here we discuss some common questions that you will have when using the package. Some are due to the limitations of openCL language and some are due to the lazyness of the package author.

##Important notes + All objects in openCL are either a scalar or a matrix, vector is not supported + NA and NULL values are not supported, I did not test them yet but it should be OK to pass them to the device, it just may not work as you expect. This feature will be added in the near future. + Do not allocate(create) variables in for-loops or if-statements, it may cause an unexpected error. Please allocate them before the loops or if-statements. + Be careful when you do the self assignments, the code like A[1:2]=A[2:1] does not work, since the opencl does not have a temporary space to store the intermediate results and therefore the assignments will be excuted in sequence, so the above code is equivalent to A[1]=A[2];A[2]=A[1]. The value in A[2] will be incorrect. + For saving the memory space, the function arguments are shared by all threads. Therefore changing the value of the function arguments is depreciated unless the change made by one thread will not be read by the other threads. +If you find any bugs in the package that prevent you from running the openCL code, a temporary solution would be using the scalar opration, because most bugs are introduced by the poor design of the matrix operation. ## Dynamic allocation? Nope! In R language, it is fairly easy to create a matrix without knowing its size in advance, for example,

n=sample(1:10,1)
A=matrix(0,n,n)

will create an n-by-n matrix A. The size of A is undetermined until the first line of the code is evaluated. The ability to create a matrix without knowing its size before running the code is called dynamic allocation. Unfortunately, due to the language limitation, openCL does not support this feature, which means the size of every matrix in the openCL code should be known before its excution. Therefore, it is impossible to create a matrix dynamically. To solve this problem, the package provides two alternative ways to make a dynamic-like matrix.

Use macro

The gpuSapply function has an argument .macroParms to specify the macro variable. Therefore, the value of the macro will be plugged into the function and is known for the compiler. for example

dynamic_example1<-function(ind,A,n){
  B=matrix(0,n)
  C=A+B
  return(C)
}
n=10
A=matrix(runif(n),n)
#The code below wouldn't work, 
#the value of n is unknown in the compilation stage

#res_device=gpuSapply(1,dynamic_example1,A,n)

#works, the value of n is knwown as a macro in the compilation stage
res_device=gpuSapply(1,dynamic_example1,A,n,.macroParms = c("n"))
#> No device has been found! Auto dispatch the gpuSapply function to sapply

Please note that when you specify a variable as a macro, you cannot change the value of it inside the function since it is fixed

Use reference functions

If you do need a dynamic feature in your code, you can use the reference function to create it. The reference function works like a macro. For example

#Perform the column sum of the first n elements in each column of A
dynamic_example2<-function(ind,A,n){
  #Dynamically subsetting the first n elements in the ind column of the matrix A
  #Almost equivalent to A_dyn=A[1:n,ind]
  #The size of A_dyn is unknown in the compilation stage since the value of n is unknown
  A_dyn=subRef(A,1:n,ind)
  C=sum(A_dyn)
  return(C)
}
n=10
M=100
A=matrix(runif(M),M,M)
res_device=gpuSapply(1:M,dynamic_example2,A,n)
#> No device has been found! Auto dispatch the gpuSapply function to sapply
res_cpu=sapply(1:M,dynamic_example2,A,n)
#check error
range(res_device-res_cpu)
#> [1] 0 0

With the reference function, you can allocate a large matrix and then dynamically subset it to create the matrix with the size you need. However, you need to be careful when you do the operations on the reference object. In the example above, any changes to the reference object A_dyn will cause the change in its target object A. Please refer to the function list to see the available reference functions.

##Dynamic return Since dynamic allocation is not possible, you may wonder if returning an object with dynamic size is possible. The answer is yes but it is depreciated. If you write multiple return functions in your code. The package will first compute the maximum length of your return variables, and then allocate the memory according to the maximum length. If the length of one return variable cannot be correctly infered, you will get a warning message, and the length of that variable will be ignored. Therefore, in order to correctly return a variable of undetermined length, you should create a matrix which size is always larger than the return variable, then you return that matrix in the end. Therefore the compiler will always have enough space for your return variable( but with redundant space). For example:

dynamic_return<-function(ind,A,n){
  #Dynamically subsetting the first n elements in the ind column of the matrix A
  #equivalent to A_dyn=A[1:n,ind]
  A_dyn=subRef(A,1:n,ind)
  #return A_dyn, the size is unknown
  return(A_dyn)
  
  #The largest length of the variable A_dyn
  tmp=matrix(0,nrow(A))
  #tell the compiler the return size should be at least the same as the variable tmp
  return(tmp)

}
n=10
M=20
A=matrix(runif(M*M),M,M)
#Expect a warning message: Undetermined return size has been found
suppressWarnings({
  res_device=gpuSapply(1:M,dynamic_return,A,n)
})
#> No device has been found! Auto dispatch the gpuSapply function to sapply
#retrieve the result (The first n rows)
res_device=res_device[1:n,]

#check error
range(res_device-A[1:n,])
#> [1] 0 0

In the future, the package will provide a more concise way to specify the return size.

##Oprations between two matrice with different size? Not implemented. The language used in the package is only a subset of the R language, it fully supports scalar to scalar, matrix to matrix, scalar to matrix, and matrix to scalar opration. However, for the matrix to matrix operations, they must have the same size, otherwise an error will be given(eg. matrix + vector is not allowed). An alternative solution for the matrix to vector operations is using the sweep function, please refer to the R documentation to see the usage.

##Error checking For performance reasons, the package only performs the error checking in the compilation stage. Therefore, when an error occurs on GPU, the code will continue to run silently, and crash R at some point without any hint. It is always a good idea to run a sample code with thesapply function to check for possible errors before you launch the rocker with your precious data.

##Sequence variable For the sequence variable, you can create it without limitation. As you see in the dynamic_example2 function, we have implicitly created a sequence variable 1:n, whose size is unknown. You can use the function seq or : to create the sequence. the only exception is that you cannot change the value of the sequence variable, because the variable is in a compact format and does not own a memory. If you do want to create a sequence variable in memory, an alternative way is to create the sequence first and assign it to a new variable, then the compiler will allocate the memory for the new variable.

##Calling another user-defined R function inside the R function The current version does not support calling user-defined R functions inside your main function, this feature is under development and will come in the near future.

#More examples

Here we provide more examples to show the usage of the package. Please note that not all the functions will be efficient when running on the GPU, it is possible that the CPU code can be faster than the GPU. Depending on the job type, you should decide which device you want to run on by calling setDevice() .

##Finding the indices of largest k values in each column of A

largestK<-function(ind,A,k){
  #get a column of A
  A_col=subRef(A,,ind)
  #the largest k values and indices of A_col
  largest_value=matrix(0,k)
  largest_ind=matrix(0,k)
  #loop over A_col to find the largest k values
  for(i in 1:length(A_col)){
    for(j in 1:k){
      if(A_col[i]>largest_value[j]){
        if(j!=k){
          #Backward assignment to avoid self assignment problem
          #(see section: Important note)
          ind_src=(k-1):j
          largest_value[ind_src+1]=largest_value[ind_src]
          largest_ind[ind_src+1]=largest_ind[ind_src]
        }
        largest_value[j]=A_col[i]
        largest_ind[j]=i
        break
      }
    }
  }
  return_nocpy(largest_ind)
}

N=1000
M=1000
k=10
A=matrix(runif(N*M),N,M)
#Warmup
warm_res=gpuSapply(1:M,largestK,A=A,k=k,.macroParms = "k")
#> No device has been found! Auto dispatch the gpuSapply function to sapply
warm_res=sapply(1:M,largestK,A=A,k=k)
#count the time
system.time({res_device=gpuSapply(1:M,largestK,A=A,k=k,.macroParms = "k")})
#> No device has been found! Auto dispatch the gpuSapply function to sapply
#>    user  system elapsed 
#>   0.650   0.001   0.650
system.time({res_cpu=sapply(1:M,largestK,A=A,k=k)})
#>    user  system elapsed 
#>   0.641   0.000   0.641
#Check the error
range(res_device-res_cpu)
#> [1] 0 0

##Brute-force method to find the solution of the linear regression with L1 loss function

computeLoss<-function(ind,x,y,parms){
  #Find the parameters for the thread
  parm=parms[ind,]
  #Compute y hat, use no-copy transpose
  y_hat=x%*%t_nocpy(parm)
  #absolute loss value(L1 loss)
  loss=sum(abs(y-y_hat))
  return(loss)
}
#Sample size
n=1000
#Number of parameters
p=2

beta=c(2,3)
#Generate data
x=matrix(runif(n*p),n,p)
e=runif(n)
y=x%*%beta+e

#Brute-force search
#The seaching range and precision
search_range=seq(0,10,0.1)
parms=expand.grid(search_range,search_range)
#Warmup
warm_res=gpuSapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms)
#> No device has been found! Auto dispatch the gpuSapply function to sapply
warm_res=sapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms)

#count the time
system.time({res_device=gpuSapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms)})
#> No device has been found! Auto dispatch the gpuSapply function to sapply
#>    user  system elapsed 
#>   0.779   0.000   0.779
system.time({res_cpu=sapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms)})
#>    user  system elapsed 
#>   0.776   0.000   0.776

#print out the parameters which minimize the loss function
parms[which.min(res_device),]
#>      Var1 Var2
#> 3460  2.5  3.4
parms[which.min(res_cpu),]
#>      Var1 Var2
#> 3460  2.5  3.4

#Supported functions The package currently supports a limited number of functions, the usage of them are similar but not promised to be the same as the R functions. For example, the drop argument in [ function is not supported yet. Therefore, in the current stage, it is better to only call the simplest form of the function. The current available functions are:

Function Note
nrow the value can be used as a dimensional parameter to create a matrix
ncol the value can be used as a dimensional parameter to create a matrix
length
matrix
[ drop does not work
abs
floor
ceiling
t
sum Supports one argument only
rowSums Supports one argument only
colSums Supports one argument only
rowMeans Supports one argument only
colMeans Supports one argument only
sort Supports one argument only, ascending sort
sweep
+
-
*
/
^
>
>=
<
<=
==
%*%
return
Explicit declaration function note
subRef Create a reference object to a subset of a matrix, the usage is same as [ function
seq Support from, to, by,length.out arguments only
:
Matrix Declare a matrix(Depreciated)
Scalar Declare a scalar(Depreciated)

Besides traditional R function, the package also provides a few no-copy methods. These methods do not allocate memory for the result. Instead, they return a referent(AKA pointer) pointing to the target matrix. Therefore, any changes made in the reference object will cause the changes in the target matrix. This feature is for saving memory usage, but it is could be very useful when you need “dynamic” allocation. You can first allocate a matrix with fixed dimension and then create a reference object pointing to that memory space, the size of the reference object can be undetermined in the compilation stage.

Function Note
subRef No-copy subsetting, the usage is same as [
t_nocpy No-copy transpose, it return a reference of the transpose of a matrix
return_nocpy No-copy return, return the variable without an additional copy, it does not support reference object

#R version and Session info Here are the platform information used to compile this vignette.

version
#>                _                           
#> platform       x86_64-pc-linux-gnu         
#> arch           x86_64                      
#> os             linux-gnu                   
#> system         x86_64, linux-gnu           
#> status                                     
#> major          4                           
#> minor          4.2                         
#> year           2024                        
#> month          10                          
#> day            31                          
#> svn rev        87279                       
#> language       R                           
#> version.string R version 4.4.2 (2024-10-31)
#> nickname       Pile of Leaves
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] gpuMagic_1.23.0  BiocStyle_2.35.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] sass_0.4.9          generics_0.1.3      class_7.3-22       
#>  [4] stringi_1.8.4       lattice_0.22-6      hms_1.1.3          
#>  [7] digest_0.6.37       magrittr_2.0.3      evaluate_1.0.1     
#> [10] grid_4.4.2          mvtnorm_1.3-2       fastmap_1.2.0      
#> [13] cellranger_1.1.0    jsonlite_1.8.9      Matrix_1.7-1       
#> [16] e1071_1.7-16        BiocManager_1.30.25 httr_1.4.7         
#> [19] DescTools_0.99.58   codetools_0.2-20    jquerylib_0.1.4    
#> [22] cli_3.6.3           rlang_1.1.4         expm_1.0-0         
#> [25] withr_3.0.2         cachem_1.1.0        yaml_2.3.10        
#> [28] rootSolve_1.8.2.4   tools_4.4.2         pryr_0.1.6         
#> [31] lmom_3.2            gld_2.6.6           Exact_3.3          
#> [34] forcats_1.0.0       boot_1.3-31         Deriv_4.1.6        
#> [37] BiocGenerics_0.53.3 buildtools_1.0.0    vctrs_0.6.5        
#> [40] R6_2.5.1            proxy_0.4-27        lifecycle_1.0.4    
#> [43] stringr_1.5.1       MASS_7.3-61         pkgconfig_2.0.3    
#> [46] bslib_0.8.0         pillar_1.10.0       data.table_1.16.4  
#> [49] glue_1.8.0          Rcpp_1.0.13-1       haven_2.5.4        
#> [52] xfun_0.49           tibble_3.2.1        rstudioapi_0.17.1  
#> [55] sys_3.4.3           knitr_1.49          htmltools_0.5.8.1  
#> [58] rmarkdown_2.29      maketools_1.3.1     compiler_4.4.2     
#> [61] readxl_1.4.3