title: "adverSCarial, generate and analyze the vulnerability of scRNA-seq
    classifiers to adversarial attacks"
shorttitle: "adverSCarial"
author: Ghislain FIEVET <ghislain.fievet@gmail.com>
package: adverSCarial
abstract: >
    adverSCarial is an R Package designed for generating and analyzing the vulnerability of scRNA-seq
    classifiers to adversarial attacks. The package is versatile and provides a format for integrating
    any type of classifier. It offers functions for studying and generating two types of attacks,
    min change attack and max change attack. The min change attack involves making a small modification
    to the input to alter the classification. The max change attack involves making a large modification
    to the input without changing its classification.
    The package provides a comprehensive solution for evaluating the robustness of scRNA-seq classifiers
    against adversarial attacks.
vignette: >
  %\VignetteIndexEntry{Vign03_adapt_classifiers}
  %\VignetteEngine{knitr::knitr}
  %\VignetteEncoding{UTF-8}


# Prepare a classifier with `CHETAH` and `scType`

All classifiers needs to be formated in a certain way to be compatible with the adverSCarial package. We provide an example on a simple formating with `CHETAH`, and a more advanced formating with `scType`.

## `CHETAH`

Here we demonstrate how to implement a classifier working with the single-gene and max-change attack, by taking the example of `CHETAH` a Bioconductor scRNA-seq classifier. 

## Load data
  

```r
library(adverSCarial)
library(TENxPBMCData)
library(CHETAH)
library(scater)
library(scran)
```

First let's load a `train` and a `test` dataset.

```r
train_3k <- TENxPBMCData(dataset = "pbmc3k")
test_4k <- TENxPBMCData(dataset = "pbmc4k")

cell_types_3k <- system.file("extdata", "pbmc3k_cell_types.tsv", package="adverSCarial")
cell_types_3k <- read.table(cell_types_3k, sep="\t")
colData(train_3k)$celltypes <- cell_types_3k$cell_type
colnames(train_3k) = colData(train_3k)[['Barcode']]
colnames(test_4k) = colData(test_4k)[['Barcode']]
```


Then we process the `test_4k` to annotate and visualize the cell types.

We annotate cells with `CHETAH`.


```r
input <- CHETAHclassifier(input = test_4k, ref_cells = train_3k)
input <- Classify(input = input, 0.00001)
colData(test_4k)$celltypes <- input$celltype_CHETAH
```

Process data.


```r
test_4k <- logNormCounts(test_4k)
dec <- modelGeneVar(test_4k)
hvg <- getTopHVGs(dec, prop=0.1)
test_4k <- runPCA(test_4k, ncomponents=25, subset_row=hvg)
test_4k <- runUMAP(test_4k, dimred = 'PCA')
```

Visualize the results


```r
plotUMAP(test_4k, colour_by="celltypes")
```

![plot of chunk chetah dimplot](figure/chetahdimplot1.png)

## Adapt the classifier
`CHETAH` is a classifier that, when given a SingleCellExperiment object, returns a specific cell type from each cell. We need to adjust the classifier so that it can be used by *adverSCarial*.
    
Each classifier function has to be formated as follow to be used with the following functions: advSingleGene, advMaxChange, advGridMinChange, advRandWalkMinChange, maxChangeOverview, minSingleGeneOverview.
```R
    classifier = function(expr, clusters, target){
                
                # `score` should be numeric between 0 and 1
                # 1 being the highest confidance into the cell type classification.
                c(
                    prediction="cell type",
                    odd=score)
    }
```
    
The `expr` argument contrains the RNA expression values, can be a *matrix*, a *data.frame* or a *SingleCellExperiment*.
The list `clusters` consists of the cluster IDs for each cell in `expr`, and `target` is the ID of the cluster for which we want to have a classification. The function returns a vector with the classification result, and a trust indice.

This is how you can adapt `CHETAH` for `adverSCarial`.

```r
CHETAHClassifier <- function(expr, clusters, target){
    if (!exists("reference_3k")) {
        reference_3k <<- train_3k
    }
    input <- CHETAHclassifier(input = expr, ref_cells = reference_3k)
    input <- Classify(input = input, 0.01)
    final_predictions = input$celltype_CHETAH[clusters == target]
    ratio <- as.numeric(sort(table(final_predictions), decreasing = TRUE)[1]) /
        sum(as.numeric(sort(table(final_predictions), decreasing = TRUE)))
    predicted_class <- names(sort(table(final_predictions), decreasing = TRUE)[1])
    if ( ratio < 0.3){
        predicted_class <- "NA"
    }
    c(prediction=predicted_class,
      odd=ratio)
}
```

This classifier takes as input a SingleCellExperiment object, you need to specify the `argForClassif="SingleCellExperiment"`
argument in *adverSCarial* function. If your classifier takes as input a *matrix* or a *data.frame* you can let the default 
`argForClassif="data.frame"` argument.

You can now test `CHETAH` classifier with `adverSCarial` tools.

Let's run a `maxChangeAttack`.


```r
adv_max_change <- advMaxChange(test_4k, colData(test_4k)$celltypes, "CD14+ Mono", CHETAHClassifier, advMethod="perc99", maxSplitSize = 2000, argForClassif="SingleCellExperiment")
```


Let's run this attack and verify if it is successful.

First we modify the `test_4k` SingleCellExperiment object on the target cluster, on the genes previously determined.


```r
test_4k_adver <- advModifications(test_4k, adv_max_change@values, colData(test_4k)$celltypes, "CD14+ Mono", argForClassif="SingleCellExperiment")
```

Then we verify that classification is still `DC`.

```r
rf_result <- CHETAHClassifier(test_4k_adver, colData(test_4k)$celltypes, "CD14+ Mono")
```

```r
rf_result
```

```
## [1] "CD14+ Mono" "1"
```

## `scType`

Here we demonstrate how to implement a classifier working with all the attacks, include the gradient-based CGD, by taking the example of `scType` a scRNA-seq classifier.

Ianevski, A., Giri, A.K., Aittokallio, T. Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data. Nat Commun, 2022;13:1246. https://doi.org/10.1038/s41467-022-28803-w 

## Load data

```r
df_pbmc_test <- read.table("/home/gui/Dropbox/INSERM/jupyterlab/0033_these/data/v2/seurat_scaled_pbmc_test.txt")
expr_df <- df_pbmc_test[, -which(names(df_pbmc_test) == "y")]
clusters_df <- df_pbmc_test$y
names(clusters_df) <- rownames(df_pbmc_test)
```

RNA expression matrix.

```r
expr_df[1:5,1:5]
```

```
##                   AL627309.1  AP006222.2 RP11.206L10.2 RP11.206L10.9
## AAACATACAACCAC-1 -0.05812316 -0.03357571   -0.04166819   -0.03364562
## AAACATTGATCAGC-1 -0.05812316 -0.03357571   -0.04166819   -0.03364562
## AAACGCACTGGTAC-1 -0.05812316 -0.03357571   -0.04166819   -0.03364562
## AAATGTTGCCACAA-1 -0.05812316 -0.03357571   -0.04166819   -0.03364562
## AACACGTGCAGAGG-1 -0.05812316 -0.03357571   -0.04166819   -0.03364562
##                    LINC00115
## AAACATACAACCAC-1 -0.08223981
## AAACATTGATCAGC-1 -0.08223981
## AAACGCACTGGTAC-1 -0.08223981
## AAATGTTGCCACAA-1 -0.08223981
## AACACGTGCAGAGG-1 -0.08223981
```

Cell clusters

```r
head(clusters_df)
```

```
## AAACATACAACCAC-1 AAACATTGATCAGC-1 AAACGCACTGGTAC-1 AAATGTTGCCACAA-1 
##   "Memory CD4 T"   "Memory CD4 T"   "Memory CD4 T"   "Memory CD4 T" 
## AACACGTGCAGAGG-1 AACCGCCTAGCGTT-1 
##   "Memory CD4 T"   "Memory CD4 T"
```

We can find how to use scType here: https://github.com/IanevskiAleksandr/sc-type
We adapt it so it returns:
- the predicted cell type
- a score of the prediction
- a matrix of the likelihood of each cell type for each cell. Cell types are rows and cells are columns.
- the list of the predicted cell type for each cell

Each classifier function has to be formated as follow to be used with the all the functions of the package, especially for advCGD.
```R
    classifier = function(expr, clusters, target){
                c(
                    prediction="cell type",
                    odd=score,
                    typePredictions=my_matrix,
                    cellTypes=my_vector)
    }
```

```r
scType_classifier = function(expr, clusters, target){
    expr = t(expr)
    library(HGNChelper)
    source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")
    source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
    db_ = "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx";
    tissue = "Immune system" # e.g. Immune system,Pancreas,Liver,Eye,Kidney,Brain,Lung,Adrenal,Heart,Intestine,Muscle,Placenta,Spleen,Stomach,Thymus 
    # prepare gene sets
    gs_list <- gene_sets_prepare(db_, tissue)
    
    es.max = sctype_score(scRNAseqData = expr, scaled = T, 
                          gs = gs_list$gs_positive, gs2 = gs_list$gs_negative)
    
    if (sum(clusters == target) == 0 ){
        return( c("UNDETERMINED",1))
    }
    cell_types <- apply(t(es.max[, clusters == target]), 1, function(x){
        names(x[x == max(x)])[1]
    })
    table_cell_type <- table(cell_types)
    str_class <- names(table_cell_type[order(table_cell_type, decreasing=T)][1])
    my_odd <- mean(es.max[str_class, clusters==target])/10
    resSCtype <- list(
        # Cell type prediction for the cluster
        prediction=str_class,
        # Score of the predicted cell type
        odd=my_odd,
        # Score for each cell type for each cell
        typePredictions=es.max,
        # Cell type for each cell
        cellTypes=cell_types)

    return(resSCtype)
}
```
We check if the classifier works properly by asking him to predict the "NK" cluster.


```r
classifier_results <- scType_classifier(expr_df, clusters_df, "NK")
classifier_results$prediction
```

```
## [1] "Natural killer  cells"
```

The score of the prediction.

```r
classifier_results$odd
```

```
## [1] 0.4461257
```

Likelihood of each cell type for each cell.

```r
classifier_results$typePredictions[,clusters_df == "NK"][1:6,1:5]
```

```
##                    AAACCGTGTATGCG-1 AACGCCCTCGTACA-1 AAGATTACCTCAAG-1
## Pro-B cells             -0.05520064       0.03507093       0.16081207
## Pre-B cells             -0.59654036      -0.68228753      -0.44937559
## Naive B cells           -0.93855041      -0.04346109      -0.26012173
## Memory B cells          -1.00357418      -0.13690767      -0.34668843
## Plasma B cells          -0.60566221       0.28942711       0.07276648
## Naive CD8+ T cells       0.79524385       1.51477730       1.87037543
##                    AAGCAAGAGGTGTT-1 ACAAATTGTTGCGA-1
## Pro-B cells              -0.2828646        0.9118437
## Pre-B cells              -0.9487842        0.5254085
## Naive B cells            -1.2664716        1.7915419
## Memory B cells           -1.3210825        1.6398264
## Plasma B cells           -0.9335834        2.1244301
## Naive CD8+ T cells        0.9343186        0.2921489
```

Predicted cell type for each cell.

```r
head(classifier_results$cellTypes)
```

```
##        AAACCGTGTATGCG-1        AACGCCCTCGTACA-1        AAGATTACCTCAAG-1 
## "Natural killer  cells" "Natural killer  cells" "Natural killer  cells" 
##        AAGCAAGAGGTGTT-1        ACAAATTGTTGCGA-1        ACAGGTACTGGTGT-1 
##   "CD8+ NKT-like cells" "Natural killer  cells" "Natural killer  cells"
```

Let's run a CGD attack

```r
# Get significant genes
sign_genes <- getSignGenes(expr_df, clusters_df)
head(sign_genes$results)
```

```
##            gene         pval
## HLA.DRA HLA.DRA 1.004947e-70
## PRF1       PRF1 6.443834e-62
## NKG7       NKG7 1.070165e-79
## FCER1A   FCER1A 2.765930e-18
## TYROBP   TYROBP 8.127634e-85
## IL32       IL32 1.260937e-66
```

```r
# Run the attack
result_cgd <- advCGD(expr_df, clusters_df, "NK", classifier=scType_classifier ,
                     genes=sign_genes$results$gene[1:100] ,alpha=2, epsilon=2)
```

Modified genes

```r
result_cgd$modGenes
```

```
## [1] "NKG7" "GZMB"
```

Modified RNA expression matrix

```r
result_cgd$expr[1:5,1:5]
```

```
##                   AL627309.1  AP006222.2 RP11.206L10.2 RP11.206L10.9
## AAACATACAACCAC-1 -0.05812316 -0.03357571   -0.04166819   -0.03364562
## AAACATTGATCAGC-1 -0.05812316 -0.03357571   -0.04166819   -0.03364562
## AAACGCACTGGTAC-1 -0.05812316 -0.03357571   -0.04166819   -0.03364562
## AAATGTTGCCACAA-1 -0.05812316 -0.03357571   -0.04166819   -0.03364562
## AACACGTGCAGAGG-1 -0.05812316 -0.03357571   -0.04166819   -0.03364562
##                    LINC00115
## AAACATACAACCAC-1 -0.08223981
## AAACATTGATCAGC-1 -0.08223981
## AAACGCACTGGTAC-1 -0.08223981
## AAATGTTGCCACAA-1 -0.08223981
## AACACGTGCAGAGG-1 -0.08223981
```

Check the new classification of the result.

```r
new_classifier_results <- scType_classifier(result_cgd$expr, clusters_df, "NK")
new_classifier_results$prediction
```

```
## [1] "CD8+ NKT-like cells"
```


```r
sessionInfo()
```

```
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Paris
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] HGNChelper_0.8.1            scran_1.29.0               
##  [3] scater_1.29.0               scuttle_1.11.0             
##  [5] CHETAH_1.17.0               ggplot2_3.4.2              
##  [7] TENxPBMCData_1.19.0         HDF5Array_1.29.3           
##  [9] rhdf5_2.45.0                DelayedArray_0.27.5        
## [11] SparseArray_1.1.10          S4Arrays_1.1.4             
## [13] Matrix_1.5-4.1              SingleCellExperiment_1.23.0
## [15] SummarizedExperiment_1.31.1 Biobase_2.61.0             
## [17] GenomicRanges_1.53.1        GenomeInfoDb_1.37.2        
## [19] IRanges_2.35.2              S4Vectors_0.39.1           
## [21] BiocGenerics_0.47.0         MatrixGenerics_1.13.0      
## [23] matrixStats_1.0.0           adverSCarial_1.3.6         
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3            jsonlite_1.8.5               
##   [3] magrittr_2.0.3                ggbeeswarm_0.7.2             
##   [5] farver_2.1.1                  corrplot_0.92                
##   [7] zlibbioc_1.47.0               vctrs_0.6.3                  
##   [9] memoise_2.0.1                 DelayedMatrixStats_1.23.0    
##  [11] RCurl_1.98-1.12               htmltools_0.5.5              
##  [13] AnnotationHub_3.9.1           curl_5.0.1                   
##  [15] BiocNeighbors_1.19.0          Rhdf5lib_1.23.0              
##  [17] htmlwidgets_1.6.2             plyr_1.8.8                   
##  [19] plotly_4.10.2                 cachem_1.0.8                 
##  [21] igraph_1.5.0                  mime_0.12                    
##  [23] lifecycle_1.0.3               pkgconfig_2.0.3              
##  [25] rsvd_1.0.5                    R6_2.5.1                     
##  [27] fastmap_1.1.1                 GenomeInfoDbData_1.2.10      
##  [29] shiny_1.7.4                   digest_0.6.32                
##  [31] colorspace_2.1-0              AnnotationDbi_1.63.1         
##  [33] dqrng_0.3.0                   irlba_2.3.5.1                
##  [35] ExperimentHub_2.8.0           RSQLite_2.3.1                
##  [37] beachmat_2.17.8               labeling_0.4.2               
##  [39] filelock_1.0.2                fansi_1.0.4                  
##  [41] httr_1.4.6                    compiler_4.3.0               
##  [43] bit64_4.0.5                   withr_2.5.0                  
##  [45] BiocParallel_1.35.2           viridis_0.6.3                
##  [47] DBI_1.1.3                     highr_0.10                   
##  [49] dendextend_1.17.1             rappdirs_0.3.3               
##  [51] bluster_1.11.1                tools_4.3.0                  
##  [53] vipor_0.4.5                   beeswarm_0.4.0               
##  [55] interactiveDisplayBase_1.39.0 zip_2.3.0                    
##  [57] httpuv_1.6.11                 glue_1.6.2                   
##  [59] rhdf5filters_1.13.3           promises_1.2.0.1             
##  [61] grid_4.3.0                    cluster_2.1.4                
##  [63] reshape2_1.4.4                generics_0.1.3               
##  [65] gtable_0.3.3                  tidyr_1.3.0                  
##  [67] data.table_1.14.8             metapod_1.9.0                
##  [69] BiocSingular_1.17.0           ScaledMatrix_1.9.1           
##  [71] utf8_1.2.3                    XVector_0.41.1               
##  [73] RcppAnnoy_0.0.20              ggrepel_0.9.3                
##  [75] BiocVersion_3.18.0            pillar_1.9.0                 
##  [77] stringr_1.5.0                 limma_3.57.6                 
##  [79] later_1.3.1                   dplyr_1.1.2                  
##  [81] BiocFileCache_2.9.0           lattice_0.21-8               
##  [83] bit_4.0.5                     tidyselect_1.2.0             
##  [85] locfit_1.5-9.8                Biostrings_2.69.1            
##  [87] knitr_1.43                    gridExtra_2.3                
##  [89] edgeR_3.43.7                  xfun_0.39                    
##  [91] statmod_1.5.0                 pheatmap_1.0.12              
##  [93] stringi_1.7.12                lazyeval_0.2.2               
##  [95] yaml_2.3.7                    evaluate_0.21                
##  [97] codetools_0.2-19              tibble_3.2.1                 
##  [99] BiocManager_1.30.21           cli_3.6.1                    
## [101] uwot_0.1.15                   xtable_1.8-4                 
## [103] munsell_0.5.0                 Rcpp_1.0.10                  
## [105] bioDist_1.73.0                dbplyr_2.3.2                 
## [107] png_0.1-8                     parallel_4.3.0               
## [109] ellipsis_0.3.2                blob_1.2.4                   
## [111] sparseMatrixStats_1.13.0      bitops_1.0-7                 
## [113] viridisLite_0.4.2             scales_1.2.1                 
## [115] openxlsx_4.2.5.2              purrr_1.0.1                  
## [117] crayon_1.5.2                  rlang_1.1.1                  
## [119] cowplot_1.1.1                 KEGGREST_1.41.0
```