Principal Component Analysis (PCA) is a common exploratory tool used to evaluate large complex data sets. The resulting lower-dimensional representations are often valuable for pattern visualization, clustering, or classification of the data. However, PCA cannot be applied directly to many -omics data sets generated by newer technologies such as label-free mass spectrometry due to large numbers of non-random missing values. Here we present a sequential projection pursuit PCA (sppPCA) method for defining principal components in the presence of missing data. Our results demonstrate that this approach generates robust and informative low-dimensional data representations compared to commonly used imputation approaches.

%B Biotechniques %G eng %R 10.2144/000113978 %0 Journal Article %J Bioinformatics %D 2011 %T Improved quality control processing of peptide-centric LC-MS proteomics data. %A Matzke, Melissa M %A Waters, Katrina M %A Metz, Thomas O %A Jacobs, Jon M %A Sims, Amy C %A Baric, Ralph S %A Pounds, Joel G %A Webb-Robertson, Bobbie-Jo M %K Chromatography, Liquid %K Data Interpretation, Statistical %K Mass Spectrometry %K Peptides %K Proteome %K Proteomics %K Quality Control %K Software %X**MOTIVATION: **In the analysis of differential peptide peak intensities (i.e. abundance measures), LC-MS analyses with poor quality peptide abundance data can bias downstream statistical analyses and hence the biological interpretation for an otherwise high-quality dataset. Although considerable effort has been placed on assuring the quality of the peptide identification with respect to spectral processing, to date quality assessment of the subsequent peptide abundance data matrix has been limited to a subjective visual inspection of run-by-run correlation or individual peptide components. Identifying statistical outliers is a critical step in the processing of proteomics data as many of the downstream statistical analyses [e.g. analysis of variance (ANOVA)] rely upon accurate estimates of sample variance, and their results are influenced by extreme values.

**RESULTS: **We describe a novel multivariate statistical strategy for the identification of LC-MS runs with extreme peptide abundance distributions. Comparison with current method (run-by-run correlation) demonstrates a significantly better rate of identification of outlier runs by the multivariate strategy. Simulation studies also suggest that this strategy significantly outperforms correlation alone in the identification of statistically extreme liquid chromatography-mass spectrometry (LC-MS) runs.

**AVAILABILITY: **https://www.biopilot.org/docs/Software/RMD.php

**CONTACT: **bj@pnl.gov

**SUPPLEMENTARY INFORMATION: **Supplementary material is available at Bioinformatics online.

Liquid chromatography-mass spectrometry-based (LC-MS) proteomics uses peak intensities of proteolytic peptides to infer the differential abundance of peptides/proteins. However, substantial run-to-run variability in intensities and observations (presence/absence) of peptides makes data analysis quite challenging. The missing observations in LC-MS proteomics data are difficult to address with traditional imputation-based approaches because the mechanisms by which data are missing are unknown a priori. Data can be missing due to random mechanisms such as experimental error or nonrandom mechanisms such as a true biological effect. We present a statistical approach that uses a test of independence known as a G-test to test the null hypothesis of independence between the number of missing values across experimental groups. We pair the G-test results, evaluating independence of missing data (IMD) with an analysis of variance (ANOVA) that uses only means and variances computed from the observed data. Each peptide is therefore represented by two statistical confidence metrics, one for qualitative differential observation and one for quantitative differential intensity. We use three LC-MS data sets to demonstrate the robustness and sensitivity of the IMD-ANOVA approach.

%B J Proteome Res %G eng %R 10.1021/pr1005247