Analysis of bead-summary data

beadarray is a package for the pre-processing and analysis of Illumina BeadArray. The main advantage is being able to read raw data created by Illumina’s scanning software. Data created in this manner are in the same format regardless of the assay (i.e expression, genotyping, methylation) being performed. Thus, beadarray is able to handle all these types of data. Many functions within beadarray have been written to cope with this flexibility.

The BeadArray technology involves randomly arranged arrays of beads, with beads having the same probe sequence attached colloquially known as a bead-type. BeadArrays are combined in parallel on either a rectangular chip (BeadChip) or a matrix of 8 by 12 hexagonal arrays (Sentrix Array Matrix or SAM). The BeadChip is further divided into strips on the surface known as sections, with each section giving rise to a different image when scanned by BeadScan. These images, and associated text files, comprise the raw data for a beadarray analysis. However, for BeadChips, the number of sections assigned to each biological sample may vary from 1 on HumanHT12 chips, 2 on HumanWG6 chips or sometimes ten or more for SNP chips with large numbers of SNPs being investigated.

This vignette demonstrates the analysis of bead summary data using beadarray. The recommended approach to obtain these data is to start with bead-level data and follow the steps illustrated in the vignette beadlevel.pdf distributed with beadarray. If bead-level data are not available, the output of Illumina’s BeadStudio or GenomeStudio can be read by beadarray. Example code to do this is provided at the end of this vignette. However, the same object types are produced from either of these routes and the same functionality is available.

To make the most use of the code in this vignette, you will need to install the beadarrayExampleData and illuminaHumanv3.db packages from Bioconductor. We use the BiocManager package to install these:

install.packages("BiocManager")
BiocManager::install(c("beadarrayExampleData", "illuminaHumanv3.db"))

The code used to produce these example data is given in the vignette of beadarrayExampleData, which follow similar steps to those described in the beadlevel.pdf vignette of beadarray. The following commands give a basic description of the data.

library("beadarray")

require(beadarrayExampleData)

data(exampleSummaryData)

exampleSummaryData
## ExpressionSetIllumina (storageMode: list)
## assayData: 49576 features, 12 samples 
##   element names: exprs, se.exprs, nObservations 
## protocolData: none
## phenoData
##   rowNames: 4613710017_B 4613710052_B ... 4616494005_A (12 total)
##   varLabels: sampleID SampleFac
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_1802380 ILMN_1893287 ... ILMN_1846115 (49576
##     total)
##   fvarLabels: ArrayAddressID IlluminaID Status
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Humanv3 
## QC Information
##  Available Slots:  
##   QC Items: Date, Matrix, ..., SampleGroup, numBeads
##   sampleNames: 4613710017_B, 4613710052_B, ..., 4616443136_A, 4616494005_A

Summarized data are stored in an object of type ExpressionSetIllumina which is an extension of the ExpressionSet class developed by the Bioconductor team as a container for data from high-throughput assays. Objects of this type use a series of slots to store the data. For consistency with the definition of other ExpressionSet objects, we refer to the expression values as the exprs matrix (this stores the probe-specific average intensities) which can be accessed using exprs and subset in the usual manner. The se.exprs matrix, which stores the probe-specific variability can be accessed using se.exprs. You may notice that the expression values have already been transformed to the log2 scale, which is an option in the summarize function in beadarray. Data exported from BeadStudio or GenomeStudio will usually be un-transformed and on the scale 0 to 216.

exprs(exampleSummaryData)[1:5,1:5]
##              G:4613710017_B G:4613710052_B G:4613710054_B G:4616443079_B
## ILMN_1802380       8.454468       8.616796       8.523001       8.420796
## ILMN_1893287       5.388161       5.419345       5.162849       5.133287
## ILMN_1736104       5.268626       5.457679       5.012766       4.988511
## ILMN_1792389       6.767519       7.183788       6.947624       7.168571
## ILMN_1854015       5.556947       5.721614       5.595413       5.520391
##              G:4616443093_B
## ILMN_1802380       8.527748
## ILMN_1893287       5.221987
## ILMN_1736104       5.284026
## ILMN_1792389       7.386435
## ILMN_1854015       5.558717
se.exprs(exampleSummaryData)[1:5,1:5]
##              G:4613710017_B G:4613710052_B G:4613710054_B G:4616443079_B
## ILMN_1802380      0.2833023      0.3367157      0.2750020      0.4141796
## ILMN_1893287      0.3963681      0.3882834      0.5516421      0.6761106
## ILMN_1736104      0.4704854      0.4951260      0.4031143      0.5276266
## ILMN_1792389      0.4038533      0.4728013      0.5032908      0.3447242
## ILMN_1854015      0.5663066      0.3783570      0.5511991      0.5358812
##              G:4616443093_B
## ILMN_1802380      0.3581862
## ILMN_1893287      0.4448673
## ILMN_1736104      0.4864355
## ILMN_1792389      0.3951935
## ILMN_1854015      0.6748219

feature and pheno data

The fData and pData functions are useful shortcuts to find more information about the features (rows) and samples (columns) in the summary object. These annotations are created automatically whenever a bead-level data is summarized (see beadlevel.pdf) or read from a BeadStudio file. The fData will be added to later, but initially contains information on whether each probe is a control or not. In this example the phenoData denotes the sample group for each array; either Brain or UHRR (Universal Human Reference RNA).

head(fData(exampleSummaryData))
##              ArrayAddressID   IlluminaID  Status
## ILMN_1802380          10008 ILMN_1802380 regular
## ILMN_1893287          10010 ILMN_1893287 regular
## ILMN_1736104          10017 ILMN_1736104 regular
## ILMN_1792389          10019 ILMN_1792389 regular
## ILMN_1854015          10020 ILMN_1854015 regular
## ILMN_1904757          10021 ILMN_1904757 regular
table(fData(exampleSummaryData)[,"Status"])
## 
##                     biotin                    cy3_hyb 
##                          2                          2 
## cy3_hyb,low_stringency_hyb               housekeeping 
##                          4                          7 
##                   labeling         low_stringency_hyb 
##                          2                          4 
##                   negative                    regular 
##                        759                      48796
pData(exampleSummaryData)
##                  sampleID SampleFac
## 4613710017_B 4613710017_B      UHRR
## 4613710052_B 4613710052_B      UHRR
## 4613710054_B 4613710054_B      UHRR
## 4616443079_B 4616443079_B      UHRR
## 4616443093_B 4616443093_B      UHRR
## 4616443115_B 4616443115_B      UHRR
## 4616443081_B 4616443081_B     Brain
## 4616443081_H 4616443081_H     Brain
## 4616443092_B 4616443092_B     Brain
## 4616443107_A 4616443107_A     Brain
## 4616443136_A 4616443136_A     Brain
## 4616494005_A 4616494005_A     Brain

Subsetting the data

There are various way to subset an expressionSetIllumina object, each of which returns an ExpressionSetIllumina with the same slots, but different dimensions. When bead-level data are summarized by beadarray there is an option to apply different transformation options, and save the results as different channels in the resultant object. For instance, if summarizing two-colour data one might be interested in summarizing the red and green channels, or some combination of the two, separately. Both log2 and un-logged data are stored in the exampleSummaryData object and can be accessed by using the channel function. Both the rows and columns in the resultant ExpressionSetIllumina object are kept in the same order.

channelNames(exampleSummaryData)
## [1] "G"    "G.ul"
exampleSummaryData.log2 <- channel(exampleSummaryData, "G")
exampleSummaryData.unlogged <- channel(exampleSummaryData, "G.ul")


sampleNames(exampleSummaryData.log2)
##  [1] "4613710017_B" "4613710052_B" "4613710054_B" "4616443079_B" "4616443093_B"
##  [6] "4616443115_B" "4616443081_B" "4616443081_H" "4616443092_B" "4616443107_A"
## [11] "4616443136_A" "4616494005_A"
sampleNames(exampleSummaryData.unlogged)
##  [1] "4613710017_B" "4613710052_B" "4613710054_B" "4616443079_B" "4616443093_B"
##  [6] "4616443115_B" "4616443081_B" "4616443081_H" "4616443092_B" "4616443107_A"
## [11] "4616443136_A" "4616494005_A"
exprs(exampleSummaryData.log2)[1:10,1:3]
##              4613710017_B 4613710052_B 4613710054_B
## ILMN_1802380     8.454468     8.616796     8.523001
## ILMN_1893287     5.388161     5.419345     5.162849
## ILMN_1736104     5.268626     5.457679     5.012766
## ILMN_1792389     6.767519     7.183788     6.947624
## ILMN_1854015     5.556947     5.721614     5.595413
## ILMN_1904757     5.421553     5.320500     5.522316
## ILMN_1740305     5.417821     5.623998     5.720007
## ILMN_1665168     5.321087     5.155455     4.967601
## ILMN_2375156     5.894207     6.076418     5.638877
## ILMN_1705423     5.426463     4.806624     5.357688
exprs(exampleSummaryData.unlogged)[1:10,1:3]
##              4613710017_B 4613710052_B 4613710054_B
## ILMN_1802380    356.88235    396.46875    367.81481
## ILMN_1893287     40.85000     44.29167     38.42105
## ILMN_1736104     40.53333     46.50000     33.46154
## ILMN_1792389    112.90909    153.17647    122.65000
## ILMN_1854015     50.47059     53.26087     51.57143
## ILMN_1904757     41.45833     42.10000     49.92593
## ILMN_1740305     38.45455     51.50000     46.21429
## ILMN_1665168     42.38889     37.95000     30.46154
## ILMN_2375156     61.47368     72.73913     52.46154
## ILMN_1705423     42.38889     28.14286     38.62500

As we have seen, the expression matrix of the ExpressionSetIllumina object can be subset by column or row, In fact, the same subset operations can be performed on the ExpressionSetIllumina object itself. In the following code, notice how the number of samples and features changes in the output.

exampleSummaryData.log2[,1:4]
## ExpressionSetIllumina (storageMode: list)
## assayData: 49576 features, 4 samples 
##   element names: exprs, se.exprs, nObservations 
## protocolData: none
## phenoData
##   rowNames: 4613710017_B 4613710052_B 4613710054_B 4616443079_B
##   varLabels: sampleID SampleFac
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_1802380 ILMN_1893287 ... ILMN_1846115 (49576
##     total)
##   fvarLabels: ArrayAddressID IlluminaID Status
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Humanv3 
## QC Information
##  Available Slots:  
##   QC Items: Date, Matrix, ..., SampleGroup, numBeads
##   sampleNames: 4613710017_B, 4613710052_B, 4613710054_B, 4616443079_B
exampleSummaryData.log2[1:10,]
## ExpressionSetIllumina (storageMode: list)
## assayData: 10 features, 12 samples 
##   element names: exprs, se.exprs, nObservations 
## protocolData: none
## phenoData
##   rowNames: 4613710017_B 4613710052_B ... 4616494005_A (12 total)
##   varLabels: sampleID SampleFac
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_1802380 ILMN_1893287 ... ILMN_1705423 (10 total)
##   fvarLabels: ArrayAddressID IlluminaID Status
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Humanv3 
## QC Information
##  Available Slots:  
##   QC Items: Date, Matrix, ..., SampleGroup, numBeads
##   sampleNames: 4613710017_B, 4613710052_B, ..., 4616443136_A, 4616494005_A

The object can also be subset by a vector of characters which must correspond to the names of features (i.e. row names). Currently, no analogous functions is available to subset by sample.

randIDs <- sample(featureNames(exampleSummaryData), 1000)

exampleSummaryData[randIDs,]
## ExpressionSetIllumina (storageMode: list)
## assayData: 1000 features, 12 samples 
##   element names: exprs, se.exprs, nObservations 
## protocolData: none
## phenoData
##   rowNames: 4613710017_B 4613710052_B ... 4616494005_A (12 total)
##   varLabels: sampleID SampleFac
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_2123550 ILMN_1719531 ... ILMN_1797219 (1000 total)
##   fvarLabels: ArrayAddressID IlluminaID Status
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Humanv3 
## QC Information
##  Available Slots:  
##   QC Items: Date, Matrix, ..., SampleGroup, numBeads
##   sampleNames: 4613710017_B, 4613710052_B, ..., 4616443136_A, 4616494005_A

Exploratory analysis using boxplots

Boxplots of intensity levels and the number of beads are useful for quality assessment purposes. beadarray includes a modified version of the boxplot function that can take any valid ExpressionSetIllumina object and plot the expression matrix by default. For these examples we plot just a subset of the original exampleSummaryData object using random row IDs.

boxplot(exampleSummaryData.log2[randIDs,])

The function can also plot other assayData items, such as the number of observations.

boxplot(exampleSummaryData.log2[randIDs,], what="nObservations")

The default boxplot plots a separate box for each array, but often it is beneficial for compare expression levels between different sample groups. If this information is stored in the phenoData slot it can be incorporated into the plot. The following compares the overall expression level between UHRR and Brain samples.

boxplot(exampleSummaryData.log2[randIDs,], SampleGroup="SampleFac")

In a similar manner, we may wish to visualize the differences between sample groups for particular probe groups. As a simple example, we look at the difference between negative controls and regular probes for each array. You should notice that the negative controls as consistently lower (as expected) with the exception of array 4616443081_B.

boxplot(exampleSummaryData.log2[randIDs,], probeFactor = "Status")

Extra feature annotation is available from annotation packages in Bioconductor, and beadarray includes functionality to extract these data from the annotation packages. The annotation of the object must be set in order that the correct annotation package can be loaded. For example, the exampleSummaryData object was generated from Humanv3 data so the illuminaHumanv3.db package must be present. The addFeatureData function annotates all features of an ExpressionSetIllumina object using particular mappings from the illuminaHumanv3.db package. To see which mappings are available you can use the illuminaHumanv3() function, or equivalent from other packages.

annotation(exampleSummaryData)
## [1] "Humanv3"
exampleSummaryData.log2 <- addFeatureData(exampleSummaryData.log2, 
toAdd = c("SYMBOL", "PROBEQUALITY", "CODINGZONE", "PROBESEQUENCE", "GENOMICLOCATION"))

head(fData(exampleSummaryData.log2))
##                 Row.names ArrayAddressID   IlluminaID  Status SYMBOL
## ILMN_1802380 ILMN_1802380          10008 ILMN_1802380 regular   RERE
## ILMN_1893287 ILMN_1893287          10010 ILMN_1893287 regular   <NA>
## ILMN_1736104 ILMN_1736104          10017 ILMN_1736104 regular   <NA>
## ILMN_1792389 ILMN_1792389          10019 ILMN_1792389 regular  ARK2C
## ILMN_1854015 ILMN_1854015          10020 ILMN_1854015 regular   <NA>
## ILMN_1904757 ILMN_1904757          10021 ILMN_1904757 regular   <NA>
##              PROBEQUALITY      CODINGZONE
## ILMN_1802380      Perfect  Transcriptomic
## ILMN_1893287          Bad Transcriptomic?
## ILMN_1736104          Bad      Intergenic
## ILMN_1792389      Perfect  Transcriptomic
## ILMN_1854015          Bad      Intergenic
## ILMN_1904757   Perfect*** Transcriptomic?
##                                                   PROBESEQUENCE
## ILMN_1802380 GCCCTGACCTTCATGGTGTCTTTGAAGCCCAACCACTCGGTTTCCTTCGG
## ILMN_1893287 GGATTTCCTACACTCTCCACTTCTGAATGCTTGGAAACACTTGCCATGCT
## ILMN_1736104 TGCCATCTTTGCTCCACTGTGAGAGGCTGCTCACACCACCCCCTACATGC
## ILMN_1792389 CTGTAGCAACGTCTGTCAGGCCCCCTTGTGTTTCATCTCCTGCGCGCGTA
## ILMN_1854015 GCAGAAAACCATGAGCTGAAATCTCTACAGGAACCAGTGCTGGGGTAGGG
## ILMN_1904757 AGCTGTACCGTGGGGAGGCTTGGTCCTCTTGCCCCATTTGTGTGATGTCT
##                         GENOMICLOCATION
## ILMN_1802380     chr1:8412758:8412807:-
## ILMN_1893287   chr9:42489407:42489456:+
## ILMN_1736104 chr3:134572184:134572223:-
## ILMN_1792389  chr18:44040244:44040293:+
## ILMN_1854015 chr3:160827837:160827885:+
## ILMN_1904757 chr3:197872267:197872316:+
illuminaHumanv3()
## ####Mappings based on RefSeqID####
## Quality control information for illuminaHumanv3:
## 
## 
## This package has the following mappings:
## 
## illuminaHumanv3ACCNUM has 31857 mapped keys (of 49576 keys)
## illuminaHumanv3ALIAS2PROBE has 63289 mapped keys (of 260933 keys)
## illuminaHumanv3CHR has 29550 mapped keys (of 49576 keys)
## illuminaHumanv3CHRLENGTHS has 93 mapped keys (of 711 keys)
## illuminaHumanv3CHRLOC has 29354 mapped keys (of 49576 keys)
## illuminaHumanv3CHRLOCEND has 29354 mapped keys (of 49576 keys)
## illuminaHumanv3ENSEMBL has 29154 mapped keys (of 49576 keys)
## illuminaHumanv3ENSEMBL2PROBE has 20997 mapped keys (of 40839 keys)
## illuminaHumanv3ENTREZID has 29551 mapped keys (of 49576 keys)
## illuminaHumanv3ENZYME has 3526 mapped keys (of 49576 keys)
## illuminaHumanv3ENZYME2PROBE has 967 mapped keys (of 975 keys)
## illuminaHumanv3GENENAME has 29551 mapped keys (of 49576 keys)
## illuminaHumanv3GO has 26989 mapped keys (of 49576 keys)
## illuminaHumanv3GO2ALLPROBES has 19471 mapped keys (of 22446 keys)
## illuminaHumanv3GO2PROBE has 15200 mapped keys (of 18678 keys)
## illuminaHumanv3MAP has 29402 mapped keys (of 49576 keys)
## illuminaHumanv3OMIM has 21943 mapped keys (of 49576 keys)
## illuminaHumanv3PATH has 9180 mapped keys (of 49576 keys)
## illuminaHumanv3PATH2PROBE has 229 mapped keys (of 229 keys)
## illuminaHumanv3PMID has 29329 mapped keys (of 49576 keys)
## illuminaHumanv3PMID2PROBE has 439054 mapped keys (of 800658 keys)
## illuminaHumanv3REFSEQ has 29551 mapped keys (of 49576 keys)
## illuminaHumanv3SYMBOL has 29551 mapped keys (of 49576 keys)
## illuminaHumanv3UNIPROT has 27704 mapped keys (of 49576 keys)
## 
## 
## Additional Information about this package:
## 
## DB schema: HUMANCHIP_DB
## DB schema version: 2.1
## Organism: Homo sapiens
## Date for NCBI data: 2015-Mar17
## Date for GO data: 20150314
## Date for KEGG data: 2011-Mar15
## Date for Golden Path data: 2010-Mar22
## Date for Ensembl data: 2015-Mar13
## ####Custom Mappings based on probe sequence####
## illuminaHumanv3ARRAYADDRESS()
## illuminaHumanv3NUID()
## illuminaHumanv3PROBEQUALITY()
## illuminaHumanv3CODINGZONE()
## illuminaHumanv3PROBESEQUENCE()
## illuminaHumanv3SECONDMATCHES()
## illuminaHumanv3OTHERGENOMICMATCHES()
## illuminaHumanv3REPEATMASK()
## illuminaHumanv3OVERLAPPINGSNP()
## illuminaHumanv3ENTREZREANNOTATED()
## illuminaHumanv3GENOMICLOCATION()
## illuminaHumanv3SYMBOLREANNOTATED()
## illuminaHumanv3REPORTERGROUPNAME()
## illuminaHumanv3REPORTERGROUPID()
## illuminaHumanv3ENSEMBLREANNOTATED()

If we suspect that a particular gene may be differentially expressed between conditions, we can subset the ExpressionSetIllumina object to just include probes that target the gene, and plot the response of these probes against the sample groups. Furthermore, the different probes can be distinguished using the probeFactor parameter.

ids <- which(fData(exampleSummaryData.log2)[,"SYMBOL"] == "ALB")

boxplot(exampleSummaryData.log2[ids,], 
  SampleGroup = "SampleFac", probeFactor = "IlluminaID")

A note about ggplot2

The boxplot function in beadarray creates graphics using the ggplot2 package rather than the R base graphics system. Therefore, the standard way of manipulating graphics using par and mfrow etc will not work with the output of boxplot. However, the ggplot2 package has equivalent functionality and is a more powerful and flexible system. There are numerous tutorials on how to use the ggplot2 package, which is beyond the scope of this vignette. In the below code, we assign the results of boxplot to objects that we combine using the gridExtra package. The code also demonstrates how aspects of the plot can be altered programatically.

library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:beadarray':
## 
##     combine
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following object is masked from 'package:BiocGenerics':
## 
##     combine
bp1 <- boxplot(exampleSummaryData.log2[ids,], 
SampleGroup = "SampleFac", probeFactor = "IlluminaID") 
bp1 <- bp1+ labs(title = "ALB expression level comparison") + xlab("Illumina Probe") + ylab("Log2 Intensity")

bp2 <- boxplot(exampleSummaryData.log2[randIDs,], probeFactor = "Status") 
bp2 <- bp2 + labs(title = "Control Probe Comparison")

grid.arrange(bp1,bp2)

We can also extract the data that was used to construct the plot.

bp1$data
##            Var1         Var2     value SampleGroup  probeFactor
## 1  ILMN_1782939 4613710017_B 13.528212        UHRR ILMN_1782939
## 2  ILMN_1682763 4613710017_B 13.264742        UHRR ILMN_1682763
## 3  ILMN_1782939 4613710052_B 13.800577        UHRR ILMN_1782939
## 4  ILMN_1682763 4613710052_B 12.947888        UHRR ILMN_1682763
## 5  ILMN_1782939 4613710054_B 13.841128        UHRR ILMN_1782939
## 6  ILMN_1682763 4613710054_B 12.641636        UHRR ILMN_1682763
## 7  ILMN_1782939 4616443079_B 13.119897        UHRR ILMN_1782939
## 8  ILMN_1682763 4616443079_B 12.575922        UHRR ILMN_1682763
## 9  ILMN_1782939 4616443093_B 13.468822        UHRR ILMN_1782939
## 10 ILMN_1682763 4616443093_B 12.878392        UHRR ILMN_1682763
## 11 ILMN_1782939 4616443115_B 13.510831        UHRR ILMN_1782939
## 12 ILMN_1682763 4616443115_B 12.634381        UHRR ILMN_1682763
## 13 ILMN_1782939 4616443081_B  5.190355       Brain ILMN_1782939
## 14 ILMN_1682763 4616443081_B  5.249992       Brain ILMN_1682763
## 15 ILMN_1782939 4616443081_H  7.995407       Brain ILMN_1782939
## 16 ILMN_1682763 4616443081_H  6.788807       Brain ILMN_1682763
## 17 ILMN_1782939 4616443092_B  5.549147       Brain ILMN_1782939
## 18 ILMN_1682763 4616443092_B  5.388535       Brain ILMN_1682763
## 19 ILMN_1782939 4616443107_A  5.704762       Brain ILMN_1782939
## 20 ILMN_1682763 4616443107_A  5.617309       Brain ILMN_1682763
## 21 ILMN_1782939 4616443136_A  5.729863       Brain ILMN_1782939
## 22 ILMN_1682763 4616443136_A  5.658919       Brain ILMN_1682763
## 23 ILMN_1782939 4616494005_A  5.849509       Brain ILMN_1782939
## 24 ILMN_1682763 4616494005_A  5.598482       Brain ILMN_1682763

Other exploratory analysis

Replicate samples can also be compared using the function.

mas <- plotMA(exampleSummaryData.log2,do.log=FALSE)
## No sample factor specified. Comparing to reference array
mas

In each panel we see the MA plots for all arrays in the experiment compared to a ‘reference’ array composed of the average intensities of all probes. On an MA plot, for each probe we plot the average of the log2 -intensities from the two arrays on the x-axis and the difference in intensities (log -ratios) on the y-axis. We would expect most probes to be and hence most points on the plot should lie along the line y=0.

As with boxplot, the object returned is a ggplot2 object that can be modified by the end-user.

##Added lines on the y axis
mas + geom_hline(yintercept=c(-1.5,1.5),col="red",lty=2)

##Added a smoothed line to each plot
mas+ geom_smooth(col="red")
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 12 rows containing non-finite outside the scale range
## (`stat_smooth()`).

##Changing the color scale
mas + scale_fill_gradient2(low="yellow",mid="orange",high="red")

We can also specify a sample grouping, which will make all pairwise comparisons

mas <- plotMA(exampleSummaryData.log2,do.log=FALSE,SampleGroup="SampleFac")
mas[[1]]

Normalisation

To correct for differences in expression level across a chip and between chips we need to normalise the signal to make the arrays comparable. The normalisation methods available in the affy package, or variance-stabilising transformation from the lumi package may be applied using the normaliseIllumina function. Below we quantile normalise the log2 transformed data.

exampleSummaryData.norm <- normaliseIllumina(exampleSummaryData.log2, 
method="quantile", transform="none")

An alternative approach is to combine normal-exponential background correction with quantile normalisation as suggested in the limma package. However, this requires data that have not been log-transformed. Note that the control probes are removed from the output object

exampleSummaryData.norm2 <- normaliseIllumina(channel(exampleSummaryData, "G.ul"), 
  method="neqc", transform="none")

Filtering

Filtering non-responding probes from further analysis can improve the power to detect differential expression. One way of achieving this is to remove probes whose probe sequence has undesirable properties. Four basic annotation quality categories (Perfect',Good’, Bad' andNo match’) are defined and have been shown to correlate with expression level and measures of differential expression. We recommend removing probes assigned a Bad' orNo match’ quality score after normalization. This approach is similar to the common practice of removing lowly-expressed probes, but with the additional benefit of discarding probes with a high expression level caused by non-specific hybridization.

library(illuminaHumanv3.db)

ids <- as.character(featureNames(exampleSummaryData.norm))

qual <- unlist(mget(ids, illuminaHumanv3PROBEQUALITY, ifnotfound=NA))

table(qual)
## qual
##         Bad        Good     Good***    Good****    No match     Perfect 
##       13475         925         148         358        1739       24687 
##  Perfect*** Perfect**** 
##        6269        1975
rem <- qual == "No match" | qual == "Bad" | is.na(qual)

exampleSummaryData.filt <- exampleSummaryData.norm[!rem,]

dim(exampleSummaryData.filt)
## Features  Samples Channels 
##    34362       12        1

Differential expression

The differential expression methods available in the limma package can be used to identify differentially expressed genes. The functions lmFit and eBayes can be applied to the normalised data. In the example below, we set up a design matrix for the example experiment and fit a linear model to summaries the data from the UHRR and Brain replicates to give one value per condition. We then define contrasts comparing the Brain sample to the UHRR and calculate moderated t-statistics with empirical Bayes shrinkage of the sample variances. In this particular experiment, the Brain and UHRR samples are very different and we would expect to see many differentially expressed genes.\

Empirical array quality weights can be used to measure the relative reliability of each array. A variance is estimated for each array by the {arrayWeights function which measures how well the expression values from each array follow the linear model. These variances are converted to relative weights which can then be used in the linear model to down-weight observations from less reliable arrays which improves power to detect differential expression. You should notice that some arrays have very low weight consistent with their poor QC.

We then define a contrast comparing UHRR to Brain Reference and calculate moderated t-statistics with empirical Bayes’ shrinkage of the sample variances.

rna <- factor(pData(exampleSummaryData)[,"SampleFac"])

design <- model.matrix(~0+rna)
colnames(design) <- levels(rna)
aw <- arrayWeights(exprs(exampleSummaryData.filt), design)
aw
fit <- lmFit(exprs(exampleSummaryData.filt), design, weights=aw)
contrasts <- makeContrasts(UHRR-Brain, levels=design)
contr.fit <- eBayes(contrasts.fit(fit, contrasts))
topTable(contr.fit, coef=1)

Automating the DE analysis

A convenience function has been created to automate the differential expression analysis and repeat the above steps. The requirements to the function are a normalised object and a SampleGroup. By default, a design matrix and contrast matrix are derived from the SampleGroup by considering all pairwise contrasts. The matrices used, along with the array weights are saved in the output and can be retrieved later.

limmaRes <- limmaDE(exampleSummaryData.filt, SampleGroup="SampleFac")
## Calculating array weights
## Array weights
limmaRes
## Results of limma analysis
## Design Matrix used...
##   Brain UHRR
## 1     0    1
## 2     0    1
## 3     0    1
## 4     0    1
## 5     0    1
## 6     0    1
## .....
## 
## Array Weights....
## Contrast Matrix used...
##        Contrasts
## Levels  Brain-UHRR
##   Brain          1
##   UHRR          -1
## Array Weights....
## 2.098 2.544 ... 2.07 1.282
## Top Table
## Top 10 probes for contrast Brain-UHRR 
##                 Row.names ArrayAddressID   IlluminaID  Status SYMBOL
## ILMN_1651358 ILMN_1651358        4830541 ILMN_1651358 regular   HBE1
## ILMN_1796678 ILMN_1796678         450537 ILMN_1796678 regular   HBG1
## ILMN_1713458 ILMN_1713458        6980192 ILMN_1713458 regular    HBZ
## ILMN_1783832 ILMN_1783832        7570189 ILMN_1783832 regular  GAGE6
##              PROBEQUALITY     CODINGZONE
## ILMN_1651358      Perfect Transcriptomic
## ILMN_1796678      Perfect Transcriptomic
## ILMN_1713458      Perfect Transcriptomic
## ILMN_1783832     Good**** Transcriptomic
##                                                   PROBESEQUENCE
## ILMN_1651358 ATTCTGGCTACTCACTTTGGCAAGGAGTTCACCCCTGAAGTGCAGGCTGC
## ILMN_1796678 AGAATTCACCCCTGAGGTGCAGGCTTCCTGGCAGAAGATGGTGACTGCAG
## ILMN_1713458 GTCCTGGAGGTTCCCCAGCCCCACTTACCGCGTAATGCGCCAATAAACCA
## ILMN_1783832 CCACAGACTGGGTGTGAGTGTGAAGATGGTCCTGATGGGCAGGAGGTGGA
##                       GENOMICLOCATION     LogFC  LogOdds       pvalue
## ILMN_1651358  chr11:5289754:5289803:- -7.344650 67.23219 4.287612e-34
## ILMN_1796678  chr11:5269621:5269670:- -7.320087 66.69410 7.898248e-34
## ILMN_1713458    chr16:204444:204493:+ -6.419908 64.52756 8.879586e-33
## ILMN_1783832 chrX:49330136:49330185:+ -5.973710 64.27316 1.175228e-32
## 
## 
## Significant probes with adjusted p-value < 0.05
## Direction
##    -1     0     1 
##  4720 25619  4023
DesignMatrix(limmaRes)
##    Brain UHRR
## 1      0    1
## 2      0    1
## 3      0    1
## 4      0    1
## 5      0    1
## 6      0    1
## 7      1    0
## 8      1    0
## 9      1    0
## 10     1    0
## 11     1    0
## 12     1    0
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$`as.factor(SampleGroup)`
## [1] "contr.treatment"
ContrastMatrix(limmaRes)
##        Contrasts
## Levels  Brain-UHRR
##   Brain          1
##   UHRR          -1
ArrayWeights(limmaRes)
##          1          2          3          4          5          6          7 
## 2.09833261 2.54416003 1.46675905 1.76081750 2.15657466 1.85199681 0.01204889 
##          8          9         10         11         12 
## 0.11056049 2.50695037 2.04941743 2.06972327 1.28194050
plot(limmaRes)

Output as GRanges

To assist with meta-analysis, and integration with other genomic data-types, it is possible to export the normalised values as a GRanges object. Therefore it is easy to perform overlaps, counts etc with other data using the GenomicRanges and GenomicFeatures packages. In order for the ranges to be constructed, the genomic locations of each probe have to be obtained from the appropriate annotation package illuminaHumanv3.db in this example). Provided that this package has been installed, the mapping should occur automatically. The expression values are stored in the GRanges object along with thefeatureData`.

gr <- as(exampleSummaryData.filt[,1:5], "GRanges")
gr
## GRanges object with 37013 ranges and 14 metadata columns:
##                            seqnames        ranges strand |    Row.names
##                               <Rle>     <IRanges>  <Rle> |       <AsIs>
##   ILMN_1776601                 chr1   69476-69525      + | ILMN_1776601
##   ILMN_1665540                 chr1 324468-324517      + | ILMN_1665540
##   ILMN_1776483                 chr1 324469-324518      + | ILMN_1776483
##   ILMN_1682912                 chr1 324673-324722      + | ILMN_1682912
##   ILMN_1889155                 chr1 759949-759998      + | ILMN_1889155
##            ...                  ...           ...    ... .          ...
##   ILMN_1691189       chrUn_gl000211   23460-23509      - | ILMN_1691189
##   ILMN_1722620 chr4_gl000193_random   75089-75138      - | ILMN_1722620
##   ILMN_1821517                 chrM     8249-8287      + | ILMN_1821517
##   ILMN_1660133 chr7_gl000195_random 165531-165580      + | ILMN_1660133
##   ILMN_1684166 chr4_gl000194_random   55310-55359      - | ILMN_1684166
##                ArrayAddressID   IlluminaID   Status      SYMBOL PROBEQUALITY
##                     <numeric>     <factor> <factor> <character>  <character>
##   ILMN_1776601        3610128 ILMN_1776601  regular       OR4F5      Perfect
##   ILMN_1665540        2570482 ILMN_1665540  regular        <NA>  Perfect****
##   ILMN_1776483        6290672 ILMN_1776483  regular        <NA>  Perfect****
##   ILMN_1682912        4060014 ILMN_1682912  regular        <NA>  Perfect****
##   ILMN_1889155        1780768 ILMN_1889155  regular        <NA>   Perfect***
##            ...            ...          ...      ...         ...          ...
##   ILMN_1691189        5310750 ILMN_1691189  regular        <NA>  Perfect****
##   ILMN_1722620        5560181 ILMN_1722620  regular   LINC01667  Perfect****
##   ILMN_1821517        6550386 ILMN_1821517  regular        <NA>         Good
##   ILMN_1660133        7510136 ILMN_1660133  regular        <NA>   Perfect***
##   ILMN_1684166        7650241 ILMN_1684166  regular       MAFIP      Perfect
##                     CODINGZONE          PROBESEQUENCE        GENOMICLOCATION
##                    <character>            <character>            <character>
##   ILMN_1776601  Transcriptomic TGTGTGGCAACGCATGTGTC..     chr1:69476:69525:+
##   ILMN_1665540  Transcriptomic CAGAACTTTCTCCAGTCAGC..   chr1:324468:324517:+
##   ILMN_1776483  Transcriptomic AGAACTTTCTCCAGTCAGCC..   chr1:324469:324518:+
##   ILMN_1682912  Transcriptomic GTCGACCTCACCAGGCCCAG..   chr1:324673:324722:+
##   ILMN_1889155 Transcriptomic? GCCCCAAGTGGAGGAACCCT..   chr1:759949:759998:+
##            ...             ...                    ...                    ...
##   ILMN_1691189  Transcriptomic GCCTGTCTTCAAAACTAAGA.. chrUn_gl000211:23460..
##   ILMN_1722620  Transcriptomic CCAGCATCTCCTGGACAGTC.. chr4_gl000193_random..
##   ILMN_1821517  Transcriptomic GCAGGGCCCGTATTTACCCT..       chrM:8249:8287:+
##   ILMN_1660133 Transcriptomic? GCAGACAGCCTGAGGAAGAT.. chr7_gl000195_random..
##   ILMN_1684166  Transcriptomic TTTCTCCTCTGTCCCACTTA.. chr4_gl000194_random..
##                X4613710017_B X4613710052_B X4613710054_B X4616443079_B
##                    <numeric>     <numeric>     <numeric>     <numeric>
##   ILMN_1776601       5.65762       5.32280       5.42786       5.23788
##   ILMN_1665540       6.41036       6.19793       6.27768       6.22283
##   ILMN_1776483       6.65937       6.23047       6.31748       6.08682
##   ILMN_1682912       6.05543       6.03521       5.99879       6.01619
##   ILMN_1889155       5.51711       5.53557       5.51818       5.47225
##            ...           ...           ...           ...           ...
##   ILMN_1691189       5.26922       5.11971       5.30667       5.33993
##   ILMN_1722620       5.86809       5.88124       5.69247       5.74034
##   ILMN_1821517      13.23931      13.53013      13.60952      13.09584
##   ILMN_1660133       5.75937       5.93725       5.89674       5.52979
##   ILMN_1684166       5.44291       5.34176       5.25382       5.78671
##                X4616443093_B
##                    <numeric>
##   ILMN_1776601       5.36293
##   ILMN_1665540       6.30330
##   ILMN_1776483       6.26472
##   ILMN_1682912       6.01619
##   ILMN_1889155       5.61177
##            ...           ...
##   ILMN_1691189       5.46230
##   ILMN_1722620       5.95092
##   ILMN_1821517      13.19210
##   ILMN_1660133       5.91776
##   ILMN_1684166       5.71984
##   -------
##   seqinfo: 48 sequences from an unspecified genome; no seqlengths

The limma analysis results can also be exported as a GRanges object for downstream analysis. The elementMetadata of the output object is set to the statistics from the limma analysis.

lgr <- as(limmaRes, "GRanges")
lgr
## GRangesList object of length 1:
## $`Brain-UHRR`
## GRanges object with 37013 ranges and 3 metadata columns:
##                            seqnames        ranges strand |      LogFC   LogOdds
##                               <Rle>     <IRanges>  <Rle> |  <numeric> <numeric>
##   ILMN_1776601                 chr1   69476-69525      + | -0.0429096  -8.17005
##   ILMN_1665540                 chr1 324468-324517      + | -0.3102748  -3.16422
##   ILMN_1776483                 chr1 324469-324518      + | -0.5445102   2.91450
##   ILMN_1682912                 chr1 324673-324722      + | -0.2790607  -3.04031
##   ILMN_1889155                 chr1 759949-759998      + |  0.0095533  -8.28264
##            ...                  ...           ...    ... .        ...       ...
##   ILMN_1691189       chrUn_gl000211   23460-23509      - |  0.1446237 -7.682845
##   ILMN_1722620 chr4_gl000193_random   75089-75138      - | -0.4055233  0.282373
##   ILMN_1821517                 chrM     8249-8287      + |  0.0149122 -8.276389
##   ILMN_1660133 chr7_gl000195_random 165531-165580      + | -0.3728807 -1.675010
##   ILMN_1684166 chr4_gl000194_random   55310-55359      - |  0.0003597 -8.289516
##                     PValue
##                  <numeric>
##   ILMN_1776601 6.34482e-01
##   ILMN_1665540 1.83906e-03
##   ILMN_1776483 4.06969e-06
##   ILMN_1682912 1.61866e-03
##   ILMN_1889155 9.09142e-01
##            ...         ...
##   ILMN_1691189 2.83997e-01
##   ILMN_1722620 5.58833e-05
##   ILMN_1821517 8.74748e-01
##   ILMN_1660133 4.01359e-04
##   ILMN_1684166 9.97108e-01
##   -------
##   seqinfo: 48 sequences from an unspecified genome; no seqlengths

The data can be manipulated according to the DE stats

lgr <- lgr[[1]]
lgr[order(lgr$LogOdds,decreasing=T)]
## GRanges object with 37013 ranges and 3 metadata columns:
##                seqnames            ranges strand |        LogFC   LogOdds
##                   <Rle>         <IRanges>  <Rle> |    <numeric> <numeric>
##   ILMN_1651358    chr11   5289754-5289803      - |     -7.34465   67.2322
##   ILMN_1796678    chr11   5269621-5269670      - |     -7.32009   66.6941
##   ILMN_1713458    chr16     204444-204493      + |     -6.41991   64.5276
##   ILMN_1783832     chrX 49330136-49330185      + |     -5.97371   64.2732
##   ILMN_1782939     chr4 74285311-74285356      + |     -6.82192   63.9110
##            ...      ...               ...    ... .          ...       ...
##   ILMN_1700728     chr2 27666372-27666399      + |  1.76813e-05  -8.28952
##   ILMN_1700728     chr2 27666816-27666835      + |  1.76813e-05  -8.28952
##   ILMN_1774781    chr19 45015129-45015178      - | -1.25897e-05  -8.28952
##   ILMN_2152095     chr5 31401519-31401568      - | -6.37711e-06  -8.28952
##   ILMN_1654074    chr17 48165652-48165701      + |  4.70621e-08  -8.28952
##                     PValue
##                  <numeric>
##   ILMN_1651358 4.28761e-34
##   ILMN_1796678 7.89825e-34
##   ILMN_1713458 8.87959e-33
##   ILMN_1783832 1.17523e-32
##   ILMN_1782939 1.74926e-32
##            ...         ...
##   ILMN_1700728    0.999847
##   ILMN_1700728    0.999847
##   ILMN_1774781    0.999883
##   ILMN_2152095    0.999957
##   ILMN_1654074    1.000000
##   -------
##   seqinfo: 48 sequences from an unspecified genome; no seqlengths
lgr[p.adjust(lgr$PValue)<0.05]
## GRanges object with 9452 ranges and 3 metadata columns:
##                     seqnames          ranges strand |     LogFC   LogOdds
##                        <Rle>       <IRanges>  <Rle> | <numeric> <numeric>
##   ILMN_1709067          chr1   879456-879505      + | -0.627372   7.43633
##   ILMN_1705602          chr1   900738-900787      + | -0.611749   5.61562
##   ILMN_1770454          chr1   991196-991245      + | -1.014300  18.40797
##   ILMN_1780315          chr1 1246734-1246783      + | -0.720793   8.34688
##   ILMN_1773026          chr1 1372636-1372685      + |  0.669350  10.36332
##            ...           ...             ...    ... .       ...       ...
##   ILMN_2398587 chr6_apd_hap1 1269579-1269628      + |  -1.25312   19.7559
##   ILMN_1692486 chr6_apd_hap1 1270027-1270031      + |  -1.58605   26.9093
##   ILMN_1692486 chr6_apd_hap1 1270238-1270282      + |  -1.58605   26.9093
##   ILMN_1708006 chr6_apd_hap1 2793475-2793524      + |  -2.07723   33.1812
##   ILMN_2070300 chr6_apd_hap1 3079946-3079995      - |  -1.43871   25.3520
##                     PValue
##                  <numeric>
##   ILMN_1709067 4.73562e-08
##   ILMN_1705602 2.83116e-07
##   ILMN_1770454 1.06095e-12
##   ILMN_1780315 1.94013e-08
##   ILMN_1773026 2.69858e-09
##            ...         ...
##   ILMN_2398587 2.85759e-13
##   ILMN_1692486 2.70501e-16
##   ILMN_1692486 2.70501e-16
##   ILMN_1708006 5.95089e-19
##   ILMN_2070300 1.23259e-15
##   -------
##   seqinfo: 48 sequences from an unspecified genome; no seqlengths

We can do overlaps with other objects

library(GenomicRanges)
## Loading required package: GenomeInfoDb
  HBE1 <- GRanges("chr11", IRanges(5289580,5291373),strand="-")

  lgr[lgr %over% HBE1]
## GRanges object with 1 range and 3 metadata columns:
##                seqnames          ranges strand |     LogFC   LogOdds
##                   <Rle>       <IRanges>  <Rle> | <numeric> <numeric>
##   ILMN_1651358    chr11 5289754-5289803      - |  -7.34465   67.2322
##                     PValue
##                  <numeric>
##   ILMN_1651358 4.28761e-34
##   -------
##   seqinfo: 48 sequences from an unspecified genome; no seqlengths

Visualisation options

Having converted the DE results into a common format such as GRanges allows access to common routines, such as those provided by ggbio. For example, it is often useful to know where exactly the illumina probes are located with respect to the gene.

library(ggbio)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
tx <- TxDb.Hsapiens.UCSC.hg19.knownGene
p1 <- autoplot(tx, which=HBE1)
p2 <-   autoplot(lgr[lgr %over% HBE1])
tracks(p1,p2)
id <- plotIdeogram(genome="hg19", subchr="chr11")
tracks(id,p1,p2)

Genome-wide plots are also available

plotGrandLinear(lgr, aes(y = LogFC))

Creating a GEO submission file

Most journals are now requiring that data are deposited in a public repository prior to publication of a manuscript. Formatting the microarray and associated metadata can be time-consuming, so we have provided a function to create a template for a GEO submission. GEO require particular meta data to be recorded regarding the experimental protocols. The output of the makeGEOSubmissionFiles includes a spreadsheet with the relevant fields that can be filled in manually. The normalised and raw data are written to tab-delimited files. By default, the annotation package associated with the data is consulted to determine which probes are exported. Any probes that are present in the data, but not in the annotation package are excluded from the submission file.

rawdata <- channel(exampleSummaryData, "G")
normdata <- normaliseIllumina(rawdata)

makeGEOSubmissionFiles(normdata,rawdata)

Alternatively, GEO’s official probe annotation files can be used to decide which probes to include in the submission. You will first have to download the appropriate file from the GEO website.

download.file(
"ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL6nnn/GPL6947/annot/GPL6947.annot.gz",
destfile="GPL6947.annot.gz"
)

makeGEOSubmissionFiles(normdata,rawdata,softTemplate="GPL6947.annot.gz")

Analysing data from GEO

beadarray now contains functionality that can assist in the analysis of data available in the GEO (Gene Expression Omnibus) repository. We can download such data using GEOquery:

library(GEOquery)
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
url <- "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE33nnn/GSE33126/matrix/"
filenm <- "GSE33126_series_matrix.txt.gz"
if(!file.exists("GSE33126_series_matrix.txt.gz")) download.file(paste(url, filenm, sep=""), destfile=filenm)
gse <- getGEO(filename=filenm)
head(exprs(gse))
##               GSM820516  GSM820517  GSM820518  GSM820519  GSM820520  GSM820521
## ILMN_1343291 66665.3800 69404.6700 64128.0700 68943.9700 67827.2200 71775.3000
## ILMN_1343295 22040.1100 13046.3400 38678.9600 16641.8900 33719.8900 18933.2900
## ILMN_1651199   226.6081   205.4483   217.2475   229.0451   226.3029   203.8710
## ILMN_1651209   278.5710   253.7044   211.8002   278.0423   259.8059   265.1900
## ILMN_1651210   195.4914   195.9835   175.3356   193.9065   229.5674   164.0632
## ILMN_1651221   217.2764   206.0723   219.5992   205.0462   194.7481   233.9729
##               GSM820522  GSM820523  GSM820524  GSM820525  GSM820526  GSM820527
## ILMN_1343291 62245.5900 69713.7000 69509.2700 68244.5900 65427.4700 68436.5200
## ILMN_1343295 26170.0400  9906.9130 17166.5200 12428.9500 25297.5700 17535.6000
## ILMN_1651199   213.4431   210.4129   229.5394   212.7384   226.1345   232.2437
## ILMN_1651209   321.2587   273.4458   253.6032   310.1582   275.0126   274.9519
## ILMN_1651210   244.6696   190.9813   188.1039   199.3084   220.6229   213.3975
## ILMN_1651221   233.2414   222.1413   244.1758   209.8247   227.8108   255.9940
##               GSM820528  GSM820529  GSM820530  GSM820531  GSM820532  GSM820533
## ILMN_1343291 57608.6700 69959.7700 69509.2700 70063.7700 69647.1700 70332.3400
## ILMN_1343295 19749.1400 17854.2900 43670.6800 22849.0800 23725.6600 28747.0100
## ILMN_1651199   208.7316   229.2948   214.4033   216.6758   195.6539   252.1502
## ILMN_1651209   250.6420   255.8540   219.5752   292.4965   253.3126   237.9844
## ILMN_1651210   194.7746   173.7073   185.3380   174.6898   195.3534   191.9382
## ILMN_1651221   233.5495   194.4097   218.6604   271.4561   222.7748   241.0521

Now we convert this to an ExpressionSetIllumina; beadarray’s native class for dealing with summarised data. The annotation slot stored in the ExpressionSet is converted from a GEO identifier (e.g. GPL10558) to one recognised by beadarray (e.g. Humanv4). If no conversion is possible, the resulting object will have NULL for the annotation slot. If successful, you should notice that the object is automatically annotated against the latest available annotation package.

summaryData <- as(gse, "ExpressionSetIllumina")
summaryData
## ExpressionSetIllumina (storageMode: list)
## assayData: 48803 features, 18 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM820516 GSM820517 ... GSM820533 (18 total)
##   varLabels: title geo_accession ... tissue:ch1 (34 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ILMN_1343291 ILMN_1343295 ... ILMN_2416019 (48803
##     total)
##   fvarLabels: Row.names ID ... PROBESEQUENCE (35 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation: Humanv3 
## QC Information
##  Available Slots:  
##   QC Items: 
##   sampleNames:
head(fData(summaryData))
##                 Row.names           ID               nuID      Species Source
## ILMN_1343291 ILMN_1343291 ILMN_1343291 97viJ90hzUX_rZ.nvY Homo sapiens RefSeq
## ILMN_1343295 ILMN_1343295 ILMN_1343295 9fQSYRUddRf4Z6p6T4 Homo sapiens RefSeq
## ILMN_1651199 ILMN_1651199 ILMN_1651199 6OYpVKvaVZJlni1ChY Homo sapiens RefSeq
## ILMN_1651209 ILMN_1651209 ILMN_1651209 60abGV06oA3Va4f0rU Homo sapiens RefSeq
## ILMN_1651210 ILMN_1651210 ILMN_1651210 o7oTiLy97.l5Guommw Homo sapiens RefSeq
## ILMN_1651221 ILMN_1651221 ILMN_1651221 EllV59GiXrVNBZYKng Homo sapiens RefSeq
##               Search_Key  Transcript ILMN_Gene Source_Reference_ID   RefSeq_ID
## ILMN_1343291 ILMN_137991   ILMN_5311    EEF1A1         NM_001402.5 NM_001402.5
## ILMN_1343295       GAPDH  ILMN_27206     GAPDH         NM_002046.3 NM_002046.3
## ILMN_1651199  ILMN_45326  ILMN_46046 LOC643334         XM_944551.1 XM_944551.1
## ILMN_1651209   ILMN_8692   ILMN_8692   SLC35E2         NM_182838.1 NM_182838.1
## ILMN_1651210 ILMN_138115 ILMN_138115    DUSP22         XM_941691.1 XM_941691.1
## ILMN_1651221  ILMN_33528  ILMN_33528 LOC642820         XM_926225.1 XM_926225.1
##              Unigene_ID Entrez_Gene_ID       GI   Accession    Symbol
## ILMN_1343291                      1915 83367078 NM_001402.5    EEF1A1
## ILMN_1343295                      2597 83641890 NM_002046.3     GAPDH
## ILMN_1651199                    643334 88959248 XM_944551.1 LOC643334
## ILMN_1651209                      9906 33469136 NM_182838.1   SLC35E2
## ILMN_1651210                     56940 89070645 XM_941691.1    DUSP22
## ILMN_1651221                    642820 89031535 XM_926225.1 LOC642820
##              Protein_Product Array_Address_Id Probe_Type Probe_Start
## ILMN_1343291     NP_001393.1          3450719          S        1294
## ILMN_1343295     NP_002037.2          4490161          S         937
## ILMN_1651199     XP_949644.1          1770632          S           1
## ILMN_1651209     NP_878258.1          2940221          S        1103
## ILMN_1651210     XP_946784.1           380154          I        2975
## ILMN_1651221     XP_931318.1          2490538          S          61
##                                                        SEQUENCE Chromosome
## ILMN_1343291 TGTGTTGAGAGCTTCTCAGACTATCCACCTTTGGGTCGCTTTGCTGTTCG          6
## ILMN_1343295 CTTCAACAGCGACACCCACTCCTCCACCTTTGACGCTGGGGCTGGCATTG         12
## ILMN_1651199 ATGCGAGGCCCCAGGGTTCGGCCCCGCAGCGCCGCTGAGTCCAAGGACCG           
## ILMN_1651209 TCACGGCGTACGCCCTCATGGGGAAAATCTCCCCGGTGACTTTCAGGTCC          1
## ILMN_1651210 TGTGGACATGAGAGTTAGTTCTGTTTTGCCTGCACGGTGGGAGCGGCGTA           
## ILMN_1651221 GCCGCCCCCTGCTTCACGGAGCCTGGTCCCATCAACCGCCGAAGGGCTGA         10
##              Probe_Chr_Orientation                   Probe_Coordinates
## ILMN_1343291                     - 74284362-74284378:74284474-74284506
## ILMN_1343295                     +                     6517320-6517369
## ILMN_1651199                                                          
## ILMN_1651209                     -         940247-940251:942395-942439
## ILMN_1651210                                                          
## ILMN_1651221                     +                   83534280-83534329
##               Cytoband
## ILMN_1343291     6q13c
## ILMN_1343295 12p13.31d
## ILMN_1651199   2q37.3b
## ILMN_1651209  1p36.33a
## ILMN_1651210   6p25.3b
## ILMN_1651221          
##                                                                                   Definition
## ILMN_1343291 Homo sapiens eukaryotic translation elongation factor 1 alpha 1 (EEF1A1), mRNA.
## ILMN_1343295            Homo sapiens glyceraldehyde-3-phosphate dehydrogenase (GAPDH), mRNA.
## ILMN_1651199               PREDICTED: Homo sapiens hypothetical LOC643334 (LOC643334), mRNA.
## ILMN_1651209               Homo sapiens solute carrier family 35, member E2 (SLC35E2), mRNA.
## ILMN_1651210         PREDICTED: Homo sapiens dual specificity phosphatase 22 (DUSP22), mRNA.
## ILMN_1651221       PREDICTED: Homo sapiens hypothetical protein LOC642820 (LOC642820), mRNA.
##                                Ontology_Component
## ILMN_1343291 cytoplasm [goid 5737] [evidence NAS]
## ILMN_1343295 cytoplasm [goid 5737] [evidence NAS]
## ILMN_1651199                                     
## ILMN_1651209                                     
## ILMN_1651210                                     
## ILMN_1651221                                     
##                                                                                                                 Ontology_Process
## ILMN_1343291 protein biosynthesis [goid 6412] [evidence IEA]; translational elongation [goid 6414] [pmid 3570288] [evidence NAS]
## ILMN_1343295                                glucose metabolism [goid 6006] [evidence IEA]; glycolysis [goid 6096] [evidence NAS]
## ILMN_1651199                                                                                                                    
## ILMN_1651209                                                                                                                    
## ILMN_1651210                                                                                                                    
## ILMN_1651221                                                                                                                    
##                                                                                                                                                                                                                                                                                          Ontology_Function
## ILMN_1343291 GTP binding [goid 5525] [evidence IEA]; GTPase activity [goid 3924] [pmid 3570288] [evidence NAS]; nucleotide binding [goid 166] [evidence IEA]; protein binding [goid 5515] [pmid 15231747] [evidence IPI]; translation elongation factor activity [goid 3746] [pmid 3570288] [evidence NAS]
## ILMN_1343295                                                                                   oxidoreductase activity [goid 16491] [evidence IEA]; NAD binding [goid 51287] [evidence IEA]; glyceraldehyde-3-phosphate dehydrogenase (phosphorylating) activity [goid 4365] [pmid 3170585] [evidence NAS]
## ILMN_1651199                                                                                                                                                                                                                                                                                              
## ILMN_1651209                                                                                                                                                                                                                                                                                              
## ILMN_1651210                                                                                                                                                                                                                                                                                              
## ILMN_1651221                                                                                                                                                                                                                                                                                              
##                                                                                                                                  Synonyms
## ILMN_1343291 EEF1A; FLJ25721; CCS-3; PTI1; CCS3; MGC102687; MGC16224; EF-Tu; eEF1A-1; EEF-1; MGC131894; HNGC:16303; GRAF-1EF; LENG7; EF1A
## ILMN_1343295                                                                                                         G3PD; GAPD; MGC88685
## ILMN_1651199                                                                                                                             
## ILMN_1651209                                      MGC117254; MGC138494; DKFZp686M0869; FLJ34996; FLJ44537; MGC126715; MGC104754; KIAA0447
## ILMN_1651210                                                                                                                             
## ILMN_1651221                                                                                                                             
##                                                                            Obsolete_Probe_Id
## ILMN_1343291 PTI1; eEF1A-1; EEF1A; MGC16224; EF-Tu; EEF-1; HNGC:16303; GRAF-1EF; LENG7; EF1A
## ILMN_1343295                                                            G3PD; GAPD; MGC88685
## ILMN_1651199                                                                                
## ILMN_1651209             MGC117254; MGC138494; DKFZp686M0869; MGC126715; MGC104754; KIAA0447
## ILMN_1651210                                                                                
## ILMN_1651221                                                                                
##                   GB_ACC   SYMBOL PROBEQUALITY     CODINGZONE
## ILMN_1343291 NM_001402.5   EEF1A1      Perfect Transcriptomic
## ILMN_1343295 NM_002046.3    GAPDH      Perfect Transcriptomic
## ILMN_1651199 XM_944551.1     <NA>          Bad     Intergenic
## ILMN_1651209 NM_182838.1 SLC35E2A      Perfect Transcriptomic
## ILMN_1651210 XM_941691.1     <NA>          Bad     Intergenic
## ILMN_1651221 XM_926225.1     CLXN          Bad Transcriptomic
##                                                   PROBESEQUENCE
## ILMN_1343291 TGTGTTGAGAGCTTCTCAGACTATCCACCTTTGGGTCGCTTTGCTGTTCG
## ILMN_1343295 CTTCAACAGCGACACCCACTCCTCCACCTTTGACGCTGGGGCTGGCATTG
## ILMN_1651199 ATGCGAGGCCCCAGGGTTCGGCCCCGCAGCGCCGCTGAGTCCAAGGACCG
## ILMN_1651209 TCACGGCGTACGCCCTCATGGGGAAAATCTCCCCGGTGACTTTCAGGTCC
## ILMN_1651210 TGTGGACATGAGAGTTAGTTCTGTTTTGCCTGCACGGTGGGAGCGGCGTA
## ILMN_1651221 GCCGCCCCCTGCTTCACGGAGCCTGGTCCCATCAACCGCCGAAGGGCTGA

As we have annotated using the latest packages, we have imported the probe quality scores. We can calculate Detection scores by using the ‘No match’ probes as a reference; useful as data in repositories rarely export these data

fData(summaryData)$Status <- 
  ifelse(fData(summaryData)$PROBEQUALITY=="No match","negative","regular" )

Detection(summaryData) <- calculateDetection(summaryData, 
                              status=fData(summaryData)$Status)
##   |                                                                              |                                                                      |   0%  |                                                                              |====                                                                  |   6%  |                                                                              |========                                                              |  12%  |                                                                              |============                                                          |  18%  |                                                                              |================                                                      |  24%  |                                                                              |=====================                                                 |  29%  |                                                                              |=========================                                             |  35%  |                                                                              |=============================                                         |  41%  |                                                                              |=================================                                     |  47%  |                                                                              |=====================================                                 |  53%  |                                                                              |=========================================                             |  59%  |                                                                              |=============================================                         |  65%  |                                                                              |=================================================                     |  71%  |                                                                              |======================================================                |  76%  |                                                                              |==========================================================            |  82%  |                                                                              |==============================================================        |  88%  |                                                                              |==================================================================    |  94%  |                                                                              |======================================================================| 100%

The ‘neqc’ normalisation method from limma can also be used now.

summaryData.norm <- normaliseIllumina(summaryData,method="neqc", 
    status=fData(summaryData)$Status)
boxplot(summaryData.norm)

We can do differential expression if we know the column in the that contains sample group information

limmaResults <- limmaDE(summaryData.norm, "source_name_ch1")
## Calculating array weights
## Array weights
limmaResults
## Results of limma analysis
## Design Matrix used...
##   normal tumor
## 1      0     1
## 2      1     0
## 3      0     1
## 4      1     0
## 5      0     1
## 6      1     0
## .....
## 
## Array Weights....
## Contrast Matrix used...
##         Contrasts
## Levels   normal-tumor
##   normal            1
##   tumor            -1
## Array Weights....
## 1.091 0.759 ... 0.791 1.046
## Top Table
## Top 10 probes for contrast normal-tumor 
##                 Row.names           ID               nuID      Species Source
## ILMN_1704294 ILMN_1704294 ILMN_1704294 9epepee4eFxLof3d6A Homo sapiens RefSeq
## ILMN_1813704 ILMN_1813704 ILMN_1813704 NkGdd4Dn7f.3vlgMno Homo sapiens RefSeq
## ILMN_1724686 ILMN_1724686 ILMN_1724686 Zuc3vS45XSJ357yekk Homo sapiens RefSeq
## ILMN_1811598 ILMN_1811598 ILMN_1811598 Kx7uN9LC3dKd3ifvTU Homo sapiens RefSeq
##              Search_Key  Transcript ILMN_Gene Source_Reference_ID   RefSeq_ID
## ILMN_1704294 ILMN_12740  ILMN_12740      CDH3         NM_001793.3 NM_001793.3
## ILMN_1813704 ILMN_10606  ILMN_10606  KIAA1199         NM_018689.1 NM_018689.1
## ILMN_1724686 ILMN_24855 ILMN_177119     CLDN1         NM_021101.3 NM_021101.3
## ILMN_1811598 ILMN_15844  ILMN_15844     ADH1B         NM_000668.3 NM_000668.3
##              Unigene_ID Entrez_Gene_ID       GI   Accession   Symbol
## ILMN_1704294                      1001 45269142 NM_001793.3     CDH3
## ILMN_1813704                     57214 38638697 NM_018689.1 KIAA1199
## ILMN_1724686                      9076 21536297 NM_021101.3    CLDN1
## ILMN_1811598                       125 34577060 NM_000668.3    ADH1B
##              Protein_Product Array_Address_Id Probe_Type Probe_Start
## ILMN_1704294     NP_001784.2          7610338          S        3291
## ILMN_1813704     NP_061159.1           110181          S        6882
## ILMN_1724686     NP_066924.1          5960296          S        2967
## ILMN_1811598     NP_000659.2          2490612          S        2535
##                                                        SEQUENCE Chromosome
## ILMN_1704294 CTGGGCCTGGGCCTGCTGTGACTGACCTACAGTGGACTTTCTCTCTGGAA         16
## ILMN_1813704 GCAACGCTCCTCTGAAATGCTTGTCTTTTTTCTGTTGCCGAAATAGCTGG         15
## ILMN_1724686 GTGCTATCTGTTCAGTGATGCCCTCAGAGCTCTTGCTGTTAGCTGGCAGC          3
## ILMN_1811598 TACTGTGTGATCTTCAGTAAGTCTCTCAGGCTCTCTGAGCTTGTTCATCC          4
##              Probe_Chr_Orientation   Probe_Coordinates Cytoband
## ILMN_1704294                     +   54605556-54605605 16q22.1c
## ILMN_1813704                     +   79030856-79030905 15q25.1b
## ILMN_1724686                     - 191506599-191506648    3q28c
## ILMN_1811598                     -   95965411-95965460    4q23b
##                                                                                    Definition
## ILMN_1704294            Homo sapiens cadherin 3, type 1, P-cadherin (placental) (CDH3), mRNA.
## ILMN_1813704                                          Homo sapiens KIAA1199 (KIAA1199), mRNA.
## ILMN_1724686                                            Homo sapiens claudin 1 (CLDN1), mRNA.
## ILMN_1811598 Homo sapiens alcohol dehydrogenase IB (class I), beta polypeptide (ADH1B), mRNA.
##                                                                                                                                                  Ontology_Component
## ILMN_1704294                                                                 membrane [goid 16020] [evidence IEA]; integral to membrane [goid 16021] [evidence IEA]
## ILMN_1813704                                                                                                                                                       
## ILMN_1724686 integral to plasma membrane [goid 5887] [pmid 9647647] [evidence TAS]; membrane [goid 16020] [evidence IEA]; tight junction [goid 5923] [evidence ISS]
## ILMN_1811598                                                                                                                                                       
##                                                                                                                                                        Ontology_Process
## ILMN_1704294 homophilic cell adhesion [goid 7156] [evidence IEA]; visual perception [goid 7601] [evidence IEA]; cell adhesion [goid 7155] [pmid 2793940] [evidence TAS]
## ILMN_1813704                                                                                                                                                           
## ILMN_1724686                                calcium-independent cell-cell adhesion [goid 16338] [evidence ISS]; cell adhesion [goid 7155] [pmid 9647647] [evidence TAS]
## ILMN_1811598                                                                                                ethanol oxidation [goid 6069] [pmid 2398055] [evidence TAS]
##                                                                                                                                                                                                                                                               Ontology_Function
## ILMN_1704294                                                                                                                                                                         calcium ion binding [goid 5509] [evidence IEA]; protein binding [goid 5515] [evidence IEA]
## ILMN_1813704                                                                                                                                                                                                                                                                   
## ILMN_1724686                                                                                                                                                                                                            structural molecule activity [goid 5198] [evidence IEA]
## ILMN_1811598 metal ion binding [goid 46872] [evidence IEA]; zinc ion binding [goid 8270] [pmid 9982] [evidence IDA]; electron carrier activity [goid 9055] [pmid 2398055] [evidence TAS]; alcohol dehydrogenase activity, zinc-dependent [goid 4024] [pmid 9982] [evidence TAS]
##                         Synonyms   Obsolete_Probe_Id      GB_ACC SYMBOL
## ILMN_1704294    HJMD; PCAD; CDHP    HJMD; PCAD; CDHP NM_001793.3   CDH3
## ILMN_1813704       TMEM2L; CCSP1       TMEM2L; CCSP1 NM_018689.1  CEMIP
## ILMN_1724686 SEMP1; ILVASC; CLD1 SEMP1; ILVASC; CLD1 NM_021101.3  CLDN1
## ILMN_1811598                ADH2                ADH2 NM_000668.3  ADH1B
##              PROBEQUALITY     CODINGZONE
## ILMN_1704294      Perfect Transcriptomic
## ILMN_1813704      Perfect Transcriptomic
## ILMN_1724686      Perfect Transcriptomic
## ILMN_1811598          Bad Transcriptomic
##                                                   PROBESEQUENCE  Status
## ILMN_1704294 CTGGGCCTGGGCCTGCTGTGACTGACCTACAGTGGACTTTCTCTCTGGAA regular
## ILMN_1813704 GCAACGCTCCTCTGAAATGCTTGTCTTTTTTCTGTTGCCGAAATAGCTGG regular
## ILMN_1724686 GTGCTATCTGTTCAGTGATGCCCTCAGAGCTCTTGCTGTTAGCTGGCAGC regular
## ILMN_1811598 TACTGTGTGATCTTCAGTAAGTCTCTCAGGCTCTCTGAGCTTGTTCATCC regular
##                  LogFC  LogOdds       pvalue
## ILMN_1704294 -4.069076 20.13228 1.937124e-14
## ILMN_1813704 -5.674585 17.68160 8.574189e-13
## ILMN_1724686 -4.400890 16.07549 7.759525e-12
## ILMN_1811598  2.905752 14.58417 5.255321e-11
## 
## 
## Significant probes with adjusted p-value < 0.05
## Direction
##    -1     0     1 
##    44 47721    72

Reading bead summary data into beadarray

BeadStudio/GenomeStudio is Illumina’s proprietary software for analyzing data output by the scanning system (BeadScan/iScan). It contains different modules for analyzing data from different platforms. For further information on the software and how to export summarized data, refer to the user’s manual. In this section we consider how to read in and analyze output from the gene expression module of BeadStudio/GenomeStudio.

The example dataset used in this section consists of an experiment with one Human WG-6 version 2 BeadChip. These arrays were hybridized with the control RNA samples used in the MAQC project (3 replicates of UHRR and 3 replicates of Brain Reference RNA).

The non-normalized data for regular and control probes was output by BeadStudio/GenomeStudio.

The example BeadStudio output used in this section is available as a zip file that can be downloaded from [https://www.huber.embl.de/users/msmith/beadarray/AsuragenMAQC_BeadStudioOutput.zip] (https://www.huber.embl.de/users/msmith/beadarray/AsuragenMAQC_BeadStudioOutput.zip).

You will need to download and unzip the contents of this file to the current R working directory. Inside this zip file you will find several files including summarized, non-normalized data and a file containing control information.
We give a more detailed description of each of the particular files we will make use of below.

  • Sample probe profile (AsuragenMAQC-probe-raw.txt) - text file which contains the non-normalized summary values as output by BeadStudio. Inside the file is a data matrix with some 48,000 rows. In newer versions of the software, these data are preceded by several lines of header information. Each row is a different probe in the experiment and the columns give different measurements for the gene. For each array, we record the summarized expression level (AVG_Signal), standard error of the bead replicates (BEAD_STDERR), number of beads (Avg_NBEADS) and a detection p-value (Detection Pval) which estimates the probability of a gene being detected above the background level. When exporting this file from BeadStudio, the user is able to choose which columns to export.
  • Control probe profile (AsuragenMAQC-controls.txt) (recommended) - text file which contains the summarized data for each of the controls on each array, which may be useful for diagnostic and calibration purposes. Refer to the Illumina documentation for information on what each control measures.
  • targets file (optional) - text file created by the user specifying which sample is hybridized to each array. No such file is provided for this dataset, however we can extract sample annotation information from the column headings in the sample probe profile.

Files with normalized intensities (those with avg in the name), as well as files with one intensity value per gene (files with gene in the name) instead of separate intensities for different probes targeting the same transcript, are also available in this download.
We recommend users work with the non-normalized probe-specific data in their analysis where possible. Illumina’s background correction step, which subtracts the intensities of the negative control probes from the intensities of the regular probes, should also be avoided.

library(beadarray)
dataFile = "AsuragenMAQC-probe-raw.txt"
qcFile = "AsuragenMAQC-controls.txt"
BSData = readBeadSummaryData(dataFile = dataFile,
qcFile = qcFile, controlID = "ProbeID",
skip = 0, qc.skip = 0, qc.columns = list(exprs = "AVG_Signal",
Detection = "Detection Pval"))

The arguments of readBeadSummaryData can be modified to suit data from versions 1, 2 or 3 of BeadStudio. The current default settings should work for version 3 output. Users may need to change the argument sep, which specifies if the dataFile is comma or tab delimited and the skip argument which specifies the number of lines of header information at the top of the file. Possible skip arguments of 0, 7 and 8 have been observed, depending on the version of BeadStudio or way in which the data was exported. The columns argument is used to specify which column headings to read from dataFile and store in various matrices. Note that the naming of the columns containing the standard errors changed between versions of BeadStudio (earlier versions used BEAD STDEV in place of BEAD STDERR - be sure to check that the columns argument is appropriate for your data). Equivalent arguments (qc.sep, qc.skip and qc.columns) are used to read the data from qcFile. See the help page (?readBeadSummaryData) for a complete description of each argument to the function.

Reading IDAT files

We can also read BeadArray data in the format produced directly by the scanner, the IDAT file. The example below uses the GEOquery to obtain the four IDAT files stored as supplementary information for GEO series GSE27073. In this case the stored files have been compressed using gzip and need to be decompressed before beadarray can read them. If you are using IDAT files as they come of the scanner this step will not be necessary.

library(beadarray)
library(GEOquery)
downloadDir <- tempdir()
getGEOSuppFiles("GSE27073", makeDirectory = FALSE, baseDir = downloadDir)
idatFiles <- list.files(path = downloadDir, pattern = ".idat.gz", full.names=TRUE)
sapply(idatFiles, gunzip)
idatFiles <- list.files(path = downloadDir, pattern = ".idat", full.names=TRUE)
BSData <- readIdatFiles(idatFiles)

The output from readIdatFiles() is an object of class ExpressionSetIllumina, as described earlier.

Citing beadarray

If you use beadarray for the analysis or pre-processing of BeadArray data please cite:

Dunning MJ, Smith ML, Ritchie ME, Tavare S, beadarray: R classes and methods for Illumina bead-based data, Bioinformatics, 23(16):2183-2184

Asking for help on beadarray

Wherever possible, questions about beadarray should be sent to the Bioconductor support forum. This way, all problems and solutions will be kept in a searchable archive. When posting to this mailing list, please first consult the posting guide. In particular, state the version of beadarray and R that you are using and try to provide a reproducible example of your problem. This will help us to diagnose the problem.