---
title: "Protein data using GeomxTools"
output:
  html_document:
    theme: united
    df_print: kable
  pdf_document: default
date: 'Compiled: `r Sys.Date()`'
vignette: >
  %\VignetteIndexEntry{Protein data using GeomxTools}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.width = 10
)
```

# Overview

This tutorial demonstrates how to use GeomxTools for preprocessing protein or proteogenomics data. 

# Data Processing

Data processing is very similar to what is shown in the [Developer_Introduction_to_the_NanoStringGeoMxSet](http://www.bioconductor.org/packages/release/bioc/vignettes/GeomxTools/inst/doc/Developer_Introduction_to_the_NanoStringGeoMxSet.html) and [GeoMx Workflow](https://www.bioconductor.org/packages/release/workflows/vignettes/GeoMxWorkflows/inst/doc/GeomxTools_RNA-NGS_Analysis.html) vignettes with a couple of protein specific functions. 

```{r Load Libraries}
library(GeomxTools)
```

GeoMxSet objects can only read in one analyte at a time. With protein or proteogenomics data, the desired analyte must be added to the call to read in the object. RNA is the default analyte. 
```{r Read in Data}
datadir <- system.file("extdata","DSP_Proteogenomics_Example_Data",
                       package = "GeomxTools")

DCCFiles <- unzip(zipfile = file.path(datadir,  "/DCCs.zip"))
PKCFiles <- unzip(zipfile = file.path(datadir,  "/pkcs.zip"))
SampleAnnotationFile <- file.path(datadir, "Annotation.xlsx")


RNAData <- suppressWarnings(readNanoStringGeoMxSet(dccFiles = DCCFiles,
                                                   pkcFiles = PKCFiles,
                                                   phenoDataFile = SampleAnnotationFile,
                                                   phenoDataSheet = "Annotations",
                                                   phenoDataDccColName = "Sample_ID",
                                                   protocolDataColNames = c("Tissue", 
                                                                            "Segment_Type", 
                                                                            "ROI.Size"),
                                                   configFile = NULL,
                                                   analyte = "RNA",
                                                   phenoDataColPrefix = "",
                                                   experimentDataColNames = NULL))

proteinData <- suppressWarnings(readNanoStringGeoMxSet(dccFiles = DCCFiles,
                                                       pkcFiles = PKCFiles,
                                                       phenoDataFile = SampleAnnotationFile,
                                                       phenoDataSheet = "Annotations",
                                                       phenoDataDccColName = "Sample_ID",
                                                       protocolDataColNames = c("Tissue", 
                                                                                "Segment_Type", 
                                                                                "ROI.Size"),
                                                       configFile = NULL,
                                                       analyte = "protein",
                                                       phenoDataColPrefix = "",
                                                       experimentDataColNames = NULL))

RNAData <- aggregateCounts(RNAData)
RNAData

#Protein Data is aggregated automatically on readin
proteinData
```

By having the datasets split by analyte, each object can go through the typical QC and normalization steps specific to that analyte. 


# RNA
For RNA please refer to the [introduction](http://www.bioconductor.org/packages/release/bioc/vignettes/GeomxTools/inst/doc/Developer_Introduction_to_the_NanoStringGeoMxSet.html) or [GeoMx Workflow](https://www.bioconductor.org/packages/release/workflows/vignettes/GeoMxWorkflows/inst/doc/GeomxTools_RNA-NGS_Analysis.html) vignettes. 

# Protein 

### Segment QC
After reading in the object, we will do one QC step: flag and remove low quality ROIs 

```{r GeomxTools QC}
proteinData <- setSegmentQCFlags(proteinData, qcCutoffs = list(percentSaturation = 45,
                                                               minSegmentReads=1000, 
                                                               percentAligned=80, 
                                                               minNegativeCount=10, 
                                                               maxNTCCount=60, 
                                                               minNuclei=16000, 
                                                               minArea=20))

# low sequenced ROIs
lowSaturation <- which(as.data.frame(protocolData(proteinData)[["QCFlags"]])["LowSaturation"] == TRUE)

# remove low quality ROIs
passedQC <- proteinData[, -lowSaturation]
dim(proteinData)
dim(passedQC)
```
### Target QC

Housekeepers and negative controls (IgGs) can easily be pulled out of the dataset. 

```{r HK and IgGs}
hk.names <- hkNames(proteinData)
hk.names

igg.names <- iggNames(proteinData)
igg.names
```

For the target QC step, we identify proteins with potentially little useful signal using this figure. 
```{r target QC, fig.width= 16, fig.height=7}
fig <- qcProteinSignal(object = proteinData, neg.names = igg.names)

proteinOrder <- qcProteinSignalNames(object = proteinData, neg.names = igg.names)
genesOfInterest <- c(which(proteinOrder == "Tyrosine Hydroxylase"),
                     which(proteinOrder == "ApoA-I"),
                     which(proteinOrder == "EpCAM"))

fig()
rect(xleft = 0, xright = 4, 
     ybottom = -2, ytop = 2, density = 0, col = "#1B9E77", lwd = 2)
rect(xleft = genesOfInterest[1]-1, xright = genesOfInterest[1]+1, 
     ybottom = -2, ytop = 1.25, density = 0, col = "#D95F02", lwd = 2)
rect(xleft = genesOfInterest[2]-1, xright = genesOfInterest[2]+1, 
     ybottom = -1, ytop = 3, density = 0, col = "#66A61E", lwd = 2)
rect(xleft = genesOfInterest[3]-1, xright = genesOfInterest[3]+1, 
     ybottom = -3, ytop = 6.5, density = 0, col = "#E7298A", lwd = 2)
```

  -	Negative controls (IgGs) (cyan) are plotted on the far left of the plot.
  -	Tyrosine Hydroxylase (orange) hovers around background in all segments and might need to be excluded from analysis. 
  -	ApoA-I (green) is mostly near-background, but it has meaningfully high signal in a handful of segments. 
  -	EpCAM (pink) seems to have lower background than the negative controls. But its long range, and especially the existence of points well above background, suggests this protein has interpretable data. 

The highlighted proteins may require further investigation after differential expression analysis but can typically be kept in the study.   

```{r, fig.width= 16, fig.height=7}
proteinOrder <- qcProteinSignalNames(object = proteinData, neg.names = igg.names)

P62 <- which(proteinOrder == "P62")

fig()
rect(xleft = 3.5, xright = P62, ybottom = -6, ytop = 10, density = 2, col = "red", lty = 3)
```

However, here is example code if you choose to remove them.

In bulk:

```{r, fig.width= 16, fig.height=7, eval=FALSE}
proteinOrder <- qcProteinSignalNames(object = proteinData, neg.names = igg.names)
length(proteinOrder)

P62 <- which(proteinOrder == "P62")

fig()
rect(xleft = 3.5, xright = P62, ybottom = -6, ytop = 10, density = 2, col = "red", lty = 3)

#Right most protein where all proteins to the left will get removed
#start after the IgG targets
proteinOrder <- proteinOrder[-c((length(igg.names)+1):P62)]
length(proteinOrder)

#replot with fewer targets
fig <- qcProteinSignal(object = proteinData[proteinOrder,], neg.names = igg.names)
fig()
```

Or by specific proteins: 
```{r, fig.width= 12, fig.height=7, eval=FALSE}
proteinOrder <- qcProteinSignalNames(object = proteinData[proteinOrder,], neg.names = igg.names)
#which proteins to remove from analysis
lowTargets <- c("pan-RAS", "Neprilysin", "Olig2", "P2ry12", "p53", "NY-ESO-1", "INPP4B", "CD31", "Phospho-Alpha-synuclein (S129)", "Bcl-2")
proteinOrder <- proteinOrder[-c(which(proteinOrder %in% lowTargets))]
length(proteinOrder)

fig <- qcProteinSignal(object = proteinData[proteinOrder,], neg.names = igg.names)
fig()
```

### Normalization

For more information on protein normalization please refer to our [whitepaper](https://nanostring.com/resources/introduction-to-geomx-normalization-protein/). 

After filtering targets, we move onto normalization. There are many types of normalization and we have two built in figure types to help decide what is the best method for the dataset. 

The first is a concordance plot of a list of targets, normally the IgGs or HK, colored by ROI factors like tissue or segment type. The upper panels are the concordance plots and the lower panels are the standard deviation of the log2-ratios between the targets. This figure does not show correlations because that calculation is increased with the large range that these values can take (`r min(exprs(proteinData[igg.names,]))`-`r as.integer(max(exprs(proteinData[igg.names,])))` in this example). SD(log2(ratios)) measures essentially the same thing but is invariant to that range. However the metrics are inversed, high correlation = low SDs. 

Our motivating theory is simple: if several targets all accurately measure signal strength, they should be highly correlated with each other. More precisely, the log-ratios between them should have low SDs.

```{r IgG concordance, fig.height=8, fig.width=8}
plotConcordance(object = proteinData, targetList = igg.names, plotFactor = "Tissue")
```

Above we see good concordance amongst the IgGs, confirming they all can be used. Numbers in the top-right panels show the SD of the log2-ratios between IgGs. Importantly, we do not see a tendency for one IgG to be offset from the others, suggesting there’s no between-slide bias in calculation of background.


The second plot helps show the concordance of normalization factors. The factors are calculated on the IgG and HK targets and the area or nuclei count if provided. The lower panels are the concordance plots and the upper panels are the standard deviation of the log2-ratios between the normalization factors.

```{r norm concordance, fig.height=8, fig.width=8}
normfactors <- computeNormalizationFactors(object = proteinData,
                                           area = "AOI.Size.um2",
                                           nuclei = "Nuclei.Counts")

plotNormFactorConcordance(object = proteinData, plotFactor = "Tissue",
                          normfactors = normfactors)
```

From this plot we can conclude that:

  -	The IgGs and the housekeepers agree nicely, suggesting that if we normalize using one of them, the other will leave little artifactual signal in the data. If these factors diverged strongly, we would know that normalization with one of them would fail to account for the other, leaving an artifact in the data that must be accounted for in downstream analysis. 
  -	Area and nuclei are consistent with each other (SD log2 ratio of 0.61). 
  -	Area and nuclei diverge somewhat from the target-based normalization factors Neg geomean and HK geomean. This suggests that signal strength is not purely a result of area/cell count, or alternatively, that the neg and HK geomeans are noisy metrics. 
  -	The concordance of Negs/HKs suggests their performance is adequate, leading to the conclusion that area/nuclei are noisy measurements of signal strength in this data. 

This divergence of area and nuclei vs IgGs and HKs is common which is why Background or HK normalization is recommended. The area and nuclei plots are good QC metrics to look for outliers or additionally can help you potentially ID some preferential bias in a study design.

After choosing a normalization technique from these plots, we normalize the data. Area and nuclei normalization are not native functions in GeomxTools, if you decide on normalizing by those factors you will need to do that separately. Quantile normalization is also available if HK or background normalization are not preferred. 

```{r GeomxTools Normalization}
#HK normalization
proteinData <- normalize(proteinData, norm_method="hk", toElt = "hk_norm")

#Background normalization
proteinData <- normalize(proteinData, norm_method="neg", toElt = "neg_norm")

#Quantile normalization
proteinData <- normalize(proteinData, norm_method="quant", desiredQuantile = .75, toElt = "q_norm")

names(proteinData@assayData)
```

This dataset is now ready for downstream analysis. 

```{r}
sessionInfo()
```