| Title: | massiR: MicroArray Sample Sex Identifier |
|---|---|
| Description: | Predicts the sex of samples in gene expression microarray datasets |
| Authors: | Sam Buckberry |
| Maintainer: | Sam Buckberry <[email protected]> |
| License: | GPL-3 |
| Version: | 1.49.0 |
| Built: | 2026-06-04 07:04:55 UTC |
| Source: | https://github.com/bioc/massiR |
massi uses y chromosome probe information to cluster samples and predict the sex of each sample in gene expression microarray datasets.
| Package: | massi |
| Type: | Package |
| Version: | 0.99.0 |
| Date: | 2014-01-27 |
| License: | GPL-3 |
The massi analysis requires a typical normalized sample/probe values produced by a microarray experiment. The massi_y function will extract the y chromosome probe information and caculate y chromosome probe variance to allow the used to select the most informative probes. Using the massi_select function the used can select a probe variation threshold to reduce the number of probes used in the massi.cluster step. The massi_cluster function clusters samples into two clusters using the y chromosome probe values. Clustering is performed using the K-medoids method as implimented in the "fpc" package. There are two plotting fucntions, massi_y_plot and massi_cluster_plot, that allow the user to explore the data at various stages of the analysis. There is also a function, massi_dip, that can be used to test if there may be a sample sex-bias in the dataset.
Sam Buckberry
Maintainer: Sam Buckberry <[email protected]>
Christian Hennig (2013). fpc: Flexible procedures for clustering. R package version 2.1-6. http://CRAN.R-project.org/package=fpc
Martin Maechler (2013). diptest: Hartigan's dip test statistic for unimodality - corrected code. R package version 0.75-5. http://CRAN.R-project.org/package=diptest
Gregory R. Warnes, Ben Bolker, Lodewijk Bonebakker, Robert Gentleman, Wolfgang Huber Andy Liaw, Thomas Lumley, Martin Maechler, Arni Magnusson, Steffen Moeller, Marc Schwartz and Bill Venables (2013). gplots: Various R programming tools for plotting data. R package version 2.12.1. http://CRAN.R-project.org/package=gplots
massi_y, massi_select, massi_cluster, massi_y_plot, massi_dip, massi_cluster_plot
# load the test datasets data(massi.test.dataset, massi.test.probes) # use the massi.y function to calculate probe variation massi_y_out <- massi_y(expression_data=massi.test.dataset, y_probes=massi.test.probes) # plot probe variation to aid in deciding on the most informative subset of y chromosome probes massi_y_plot(massi_y_out) # Extract the informative probes for clustering massi_select_out <- massi_select(massi.test.dataset, massi.test.probes, threshold=4) # cluster samples to predict the sex for each sample massi_cluster_out <- massi_cluster(massi_select_out) # get the predicted sex for each sample data.frame(massi_cluster_out[[2]])# load the test datasets data(massi.test.dataset, massi.test.probes) # use the massi.y function to calculate probe variation massi_y_out <- massi_y(expression_data=massi.test.dataset, y_probes=massi.test.probes) # plot probe variation to aid in deciding on the most informative subset of y chromosome probes massi_y_plot(massi_y_out) # Extract the informative probes for clustering massi_select_out <- massi_select(massi.test.dataset, massi.test.probes, threshold=4) # cluster samples to predict the sex for each sample massi_cluster_out <- massi_cluster(massi_select_out) # get the predicted sex for each sample data.frame(massi_cluster_out[[2]])
The massi_cluster function predicts the sex of samples using k-medoids clustering.
massi_cluster(y_data)massi_cluster(y_data)
y_data |
the y_data object is the data.frame returned from the |
This function clusters samples into two clusters using y chromosome probe values. K-medoids clustering is performed using the partitioning around medoids (pam) method implimented in the "fpc" package. The cluster with the highest probe values is determined to be the cluster of male samples and the cluster the lowest values as female samples.
cluster data |
Contains all of the results from the k-medoids clustering. |
massi.results |
Contains the results for each sample, including sample id, predicted sex, sample z-score and mean probe expression. |
Sam Buckberry
Christian Hennig (2013). fpc: Flexible procedures for clustering. R package version 2.1-6. http://CRAN.R-project.org/package=fpc
massi_y, massi_select, massi_y_plot, massi_dip, massi_cluster_plot
# load the test dataset data(massi.test.dataset, massi.test.probes) # select the y chromosome probes using massi_select massi_select_out <- massi_select(massi.test.dataset, massi.test.probes) # cluster samples to predict sex using massi_cluster massi_cluster_out <- massi_cluster(massi_select_out) # get the results in a data.frame format data.frame(massi_cluster_out[[2]])# load the test dataset data(massi.test.dataset, massi.test.probes) # select the y chromosome probes using massi_select massi_select_out <- massi_select(massi.test.dataset, massi.test.probes) # cluster samples to predict sex using massi_cluster massi_cluster_out <- massi_cluster(massi_select_out) # get the results in a data.frame format data.frame(massi_cluster_out[[2]])
This function produces three figures in a new graphics device to enable the exploration of the massi_cluster and massi_select results.
massi_cluster_plot(massi_select_data, massi_cluster_data)massi_cluster_plot(massi_select_data, massi_cluster_data)
massi_select_data |
A data.frame containing the subset of y chromosome probe values for each sample. This is returned when running the massi_select function. |
massi_cluster_data |
This is the list returned from the massi_cluster function. |
The first figure is a heatmap depicting probe values for each sample. The second figure is a bar plot showing the mean probe expression and standard deviation for each sample. The bars are colored with respect to the predicted sex. The third figure is a principal component plot which represents the distances bewteen samples, with each cluster highlighted with elipses.
Returns three plots in a new graphics device.
Sam Buckberry
Gregory R. Warnes, Ben Bolker, Lodewijk Bonebakker, Robert Gentleman, Wolfgang Huber Andy Liaw, Thomas Lumley, Martin Maechler, Arni Magnusson, Steffen Moeller, Marc Schwartz and Bill Venables (2013). gplots: Various R programming tools for plotting data. R package version 2.12.1. http://CRAN.R-project.org/package=gplots
# load the test dataset data(massi.test.dataset, massi.test.probes) # select the y chromosome probes using massi_select massi_select_out <- massi_select(massi.test.dataset, massi.test.probes) # cluster samples to predict sex using massi_cluster massi_cluster_out <- massi_cluster(massi_select_out) # produce plots using massi_cluster_plot massi_cluster_plot(massi_select_out, massi_cluster_out)# load the test dataset data(massi.test.dataset, massi.test.probes) # select the y chromosome probes using massi_select massi_select_out <- massi_select(massi.test.dataset, massi.test.probes) # cluster samples to predict sex using massi_cluster massi_cluster_out <- massi_cluster(massi_select_out) # produce plots using massi_cluster_plot massi_cluster_plot(massi_select_out, massi_cluster_out)
The massi_dip function applies the dip test to the subset of y chromosome probe values returned from the massi_select function. This can be used to indicate if there may be either a male or female bias in the dataset. This function returns a message indicating if the dataset may have a sex bias. The results for massi_dip are not relaible for datasets with 10 or less samples.
massi_dip(y_subset_values)massi_dip(y_subset_values)
y_subset_values |
A data.frame containing the subset of y chromosome probe values for each sample, as returned from the |
This function caclulates z-scores for the y.chromosome probe values returned from the massi_select function and then checks if the average z-scores for each sample show a unimodal or multi-modal distribution by applying the dip test. If the proportion of male and female samples in the dataset is relatively balanced, the distribution of average z-scores should be bi-modal. If the distribution looks unimodal, the dataset likely contains a high proportion of one sex. By testing with empirical datasets and randomly generating data subsets with different male/female proportions, guideline values were developed to provide an indication if there is a potential sex bias in the dataset. If the dip statistic is > 0.08 then the dataset is highly likely to have a porportions of male and female samples that will allow the massi_cluster function to predict the sex of samples with a high degree of accuracy. The results of this test should only be used as a guide and the results should be interpreted in light of the massi_cluster results. For more details see the massi package vignette.
This function returns a list containing
dip.statistics |
The results from the dip test |
sample.mean.z.score |
The mean of the probe z-scores for each sample used to caclulate the dip statistics |
density |
Density values for the z-scores. Can be informative to plot these results |
Sam Buckberry
Martin Maechler (2013). diptest: Hartigan's dip test statistic for unimodality - corrected code. R package version 0.75-5. http://CRAN.R-project.org/package=diptest
massi_y, massi_select, massi_cluster, massi_y_plot, massi_cluster_plot
# load the test dataset data(massi.test.dataset, massi.test.probes) massi_select_out <- massi_select(expression_data=massi.test.dataset, y_probes=massi.test.probes, threshold=4) # Use the list returned from massi.select to calculate dip statistics and z-scores. massi_dip_out <- massi_dip(y_subset_values=massi_select_out) # view a density plot plot(massi_dip_out[[3]]) # view a histogram of z-scores hist(x=massi_dip_out[[2]])# load the test dataset data(massi.test.dataset, massi.test.probes) massi_select_out <- massi_select(expression_data=massi.test.dataset, y_probes=massi.test.probes, threshold=4) # Use the list returned from massi.select to calculate dip statistics and z-scores. massi_dip_out <- massi_dip(y_subset_values=massi_select_out) # view a density plot plot(massi_dip_out[[3]]) # view a histogram of z-scores hist(x=massi_dip_out[[2]])
This function selects the y chromosome probe data for each sample.
massi_select(expression_data, y_probes, threshold = 3)massi_select(expression_data, y_probes, threshold = 3)
expression_data |
The expression.data item contains normalized array expression data for all samples. This can be a data.frame with sample names as columns and probe id's as row names. This argument also allows the specification of an ExpressionSet object. |
y_probes |
A data.frame of probe id's in one column that match y chromosome genes for the array platform. massiR includes probes for several Illumina and Affymetrix platforms. Details on using these probes are included in the vignette and the |
threshold |
The threshold value corresponds to probe variation quantiles.
This option allows the selection of the most variable probes. Deciding on a probe threshold value should be informed by viewing the plot generated by the |
A data.frame containing the subset of y chromosome probe values for each sample.
Sam Buckberry
massi_y, massi_cluster, massi_y_plot, massi_dip, massi_cluster_plot
data(massi.test.dataset, massi.test.probes) massi_select(expression_data=massi.test.dataset, y_probes=massi.test.probes)data(massi.test.dataset, massi.test.probes) massi_select(expression_data=massi.test.dataset, y_probes=massi.test.probes)
The massi_y function extracts the y chromosome probe values for each sample and calculates the coefficient of variation (CV) for each probe. The returned list contains CV values (%) for each probe and quantile data. The probe variation data can be visualized using the massi_y_plot function.
massi_y(expression_data, y_probes)massi_y(expression_data, y_probes)
expression_data |
The expression.data item contains normalized array expression data for all samples. This can be a data.frame with sample names as columns and probe id's as row names. This argument also allows the specification of an ExpressionSet object. |
y_probes |
A data.frame of probe id's in one column that match y chromosome genes for the array platform. massiR includes probes for several Illumina and Affymetrix platforms. Details on using these probes are included in the vignette and the |
The expression.data must be as a data.frame with sample names as column names and probe id's as row.names. ExpressionSet objects can be input and with expression data will be exracted from the ExpressionSet and the returned list would be the same as if data as entered in data.frame format.
The massi_y function returns a list containing probe id's, probe cv and quantiles.
id |
Probe id's |
cv |
Probe cv values |
quantiles |
Quantiles of cv values data |
Sam Buckberry
massi_select, massi_cluster, massi_y_plot, massi_dip, massi_cluster_plot
data(massi.test.dataset, massi.test.probes) massi_y(massi.test.dataset, massi.test.probes)data(massi.test.dataset, massi.test.probes) massi_y(massi.test.dataset, massi.test.probes)
The massi_y_plot function plots the data output from massi.y function.
massi_y_plot(massi_y_out)massi_y_plot(massi_y_out)
massi_y_out |
This object is the list returned from |
This function produces a bar plot of the coefficient of variation (CV) for each probe in the dataset. This allows the user to identify the most variable probes that are likely to be the most informative in the sex prediction step. The 25%, 50% and 75% quantiles are represented as horizontal lines and represent the threshold values that can be specified for the massi_select function.
This function produces a bar plot in a new graphics device.
See vignette for more details.
Sam Buckberry
massi_y, massi_select, massi_cluster, massi_dip, massi_cluster_plot
data(massi.test.dataset, massi.test.probes) massi_y_out <- massi_y(expression_data=massi.test.dataset, y_probes=massi.test.probes) massi_y_plot(massi_y_out)data(massi.test.dataset, massi.test.probes) massi_y_out <- massi_y(expression_data=massi.test.dataset, y_probes=massi.test.probes) massi_y_plot(massi_y_out)
Object of ExpressionSet class containing the massiR test expression data.
data(massi.eset)data(massi.eset)
The format is: Formal class 'ExpressionSet' [package "Biobase"] with 7 slots ..@ experimentData :Formal class 'MIAME' [package "Biobase"] with 13 slots .. .. ..@ name : chr "" .. .. ..@ lab : chr "" .. .. ..@ contact : chr "" .. .. ..@ title : chr "" .. .. ..@ abstract : chr "" .. .. ..@ url : chr "" .. .. ..@ pubMedIds : chr "" .. .. ..@ samples : list() .. .. ..@ hybridizations : list() .. .. ..@ normControls : list() .. .. ..@ preprocessing : list() .. .. ..@ other : list() .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slots .. .. .. .. ..@ .Data:List of 2 .. .. .. .. .. ..$ : int [1:3] 1 0 0 .. .. .. .. .. ..$ : int [1:3] 1 1 0 ..@ assayData :<environment: 0x7fd91fda9d50> ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots .. .. ..@ varMetadata :'data.frame': 0 obs. of 1 variable: .. .. .. ..$ labelDescription: chr(0) .. .. ..@ data :'data.frame': 60 obs. of 0 variables .. .. ..@ dimLabels : chr [1:2] "rowNames" "columnNames" .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slots .. .. .. .. ..@ .Data:List of 1 .. .. .. .. .. ..$ : int [1:3] 1 1 0 ..@ featureData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots .. .. ..@ varMetadata :'data.frame': 0 obs. of 1 variable: .. .. .. ..$ labelDescription: chr(0) .. .. ..@ data :'data.frame': 1026 obs. of 0 variables .. .. ..@ dimLabels : chr [1:2] "featureNames" "featureColumns" .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slots .. .. .. .. ..@ .Data:List of 1 .. .. .. .. .. ..$ : int [1:3] 1 1 0 ..@ annotation : chr(0) ..@ protocolData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots .. .. ..@ varMetadata :'data.frame': 0 obs. of 1 variable: .. .. .. ..$ labelDescription: chr(0) .. .. ..@ data :'data.frame': 60 obs. of 0 variables .. .. ..@ dimLabels : chr [1:2] "sampleNames" "sampleColumns" .. .. ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slots .. .. .. .. ..@ .Data:List of 1 .. .. .. .. .. ..$ : int [1:3] 1 1 0 ..@ .__classVersion__:Formal class 'Versions' [package "Biobase"] with 1 slots .. .. ..@ .Data:List of 4 .. .. .. ..$ : int [1:3] 3 0 2 .. .. .. ..$ : int [1:3] 2 22 0 .. .. .. ..$ : int [1:3] 1 3 0 .. .. .. ..$ : int [1:3] 1 0 0
This data.frame object contains expression data for 60 samples and 1026 probes
data(massi.test.dataset)data(massi.test.dataset)
A data frame with 1026 observations on the following 60 variables.
S1a numeric vector
S2a numeric vector
S3a numeric vector
S4a numeric vector
S5a numeric vector
S6a numeric vector
S7a numeric vector
S8a numeric vector
S9a numeric vector
S10a numeric vector
S11a numeric vector
S12a numeric vector
S13a numeric vector
S14a numeric vector
S15a numeric vector
S16a numeric vector
S17a numeric vector
S18a numeric vector
S19a numeric vector
S20a numeric vector
S21a numeric vector
S22a numeric vector
S23a numeric vector
S24a numeric vector
S25a numeric vector
S26a numeric vector
S27a numeric vector
S28a numeric vector
S29a numeric vector
S30a numeric vector
S31a numeric vector
S32a numeric vector
S33a numeric vector
S34a numeric vector
S35a numeric vector
S36a numeric vector
S37a numeric vector
S38a numeric vector
S39a numeric vector
S40a numeric vector
S41a numeric vector
S42a numeric vector
S43a numeric vector
S44a numeric vector
S45a numeric vector
S46a numeric vector
S47a numeric vector
S48a numeric vector
S49a numeric vector
S50a numeric vector
S51a numeric vector
S52a numeric vector
S53a numeric vector
S54a numeric vector
S55a numeric vector
S56a numeric vector
S57a numeric vector
S58a numeric vector
S59a numeric vector
S60a numeric vector
This test dataset is in the data.frame format with sample names as column names and probe id's as row.names. This test dataset is a subset of an emperical dataset obtained from Illumina human-6 v2.0 expression beadchip ararys.
Data were adapted from dataset GSE25906 <http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE25906>
data(massi.test.dataset)data(massi.test.dataset)
A data.frame containing 56 probes which correspond to y chromosome genes
data(massi.test.probes)data(massi.test.probes)
data.frame
A data frame with 56 observations on the following 0 variables.
data(massi.test.probes)data(massi.test.probes)
A list containing probes corresponding to y chromosome genes for Illumina and Affymetrix platforms. Each item in the list is a data.frame of y chromosome probes that can be used in the massi analysis. The names of each item in the list correspond to the ensembl biomart attribute names.
data(y.probes)data(y.probes)
The format is: List of 6 $ illumina_humanwg_6_v1:'data.frame': 58 obs. of 0 variables $ illumina_humanwg_6_v2:'data.frame': 74 obs. of 0 variables $ illumina_humanwg_6_v1:'data.frame': 112 obs. of 0 variables $ illumina_humanht_12 :'data.frame': 112 obs. of 0 variables $ affy_hugene_1_0_st_v1:'data.frame': 138 obs. of 0 variables $ affy_hg_u133_plus_2 :'data.frame': 94 obs. of 0 variables
The y chromosome probes for each platform were downloaded from Ensembl biomart using the 'biomaRt' package. For more details on the methods of selecting the probes and how to obtain probes for other platform, see the vignette for the massiR package.
Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Steffen Durinck, Paul T. Spellman, Ewan Birney and Wolfgang Huber, Nature Protocols 4, 1184-1191 (2009).
BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Steffen Durinck, Yves Moreau, Arek Kasprzyk, Sean Davis, Bart De Moor, Alvis Brazma and Wolfgang Huber, Bioinformatics 21, 3439-3440 (2005).
# load the probes list data(y.probes) # look at the platform names names(y.probes) # extract the probes using the platform name probe.list <- y.probes[["illumina_humanwg_6_v2"]]# load the probes list data(y.probes) # look at the platform names names(y.probes) # extract the probes using the platform name probe.list <- y.probes[["illumina_humanwg_6_v2"]]