Published on March 5, 2014
Isoelectric point estimation using peptide descriptors and support vector machines Y. Perez-Riverol, E. Audain, A. Millan, Y. Ramos, A. Sanchez, J. Vizcaíno, R. Wang, M. Müller, Y. Machado, L. Betancourt, L. González, G. Padrón, V. Besada Center for Genetic Engineering and Biotechnology, P.O. Box 6162, Havana 10600, Cuba Contact email: email@example.com Therapy Feature Selection Protocol with APL led significant reduction of TNFα & Complexity reduction Introduction IPG (Immobilized pH Gradient) based separations are frequently used as the first step in shotgun proteomics methods; it yields an increase in both the dynamic range and resolution of peptide separation prior to the LC-MS analysis. Experimental isoelectric point (pI) values can improve peptide identifications in conjunction with MS/MS information . Our group has previously reported the possibility of identifying theoretically peptides and proteins based on different experimental properties . Thus, accurate estimation of the pI value based on the amino acid sequence becomes critical to perform these kinds of experiments. Nowadays, pI is commonly predicted using the charge-state model , and/or the cofactor algorithm . However, none of these methods is capable of calculating the pI value for basic peptides accurately. In this manuscript, we present an new approach that can significant improve the pI estimation, by using Support Vector Machines (SVM), an experimental amino acid descriptor taken from the AAIndex database  and the isoelectric point predicted by the charge-state model. TLC analysis of in-vitro fructan production using extracts from leaves, stem and seeds of transgenic line 3. A) Leaves extracts were subjected to IMAC. Proteins bound to Ni-NTA beads were eluted with 250 mM imidazole and incubated with 200 mM sucrose for 24 h at 30ºC. Lanes: 1, fructans from onion bulb; 2, nontransformed plant; 3, transgenic line 3. B) Stem and seed extracts from transgenic line 3 were incubated with 200 mM sucrose for 24 h at 30ºC. Lanes: 1, substrate (control); 2, heatinactivated stem extract; 3, stem extract; 4, seed extract; 5, marker. Correlation matrix on the predictors show the correlation between all the calculated descriptors. Then, a subset of more problematic descriptors (cor > 0.7 ) were removed. Finally, the feature select algorithm reduces the feature space from 555 to 44 descriptors. Kernel Experimental data & processing protocol Number of RMSD Function Predictors Polynomial 0.3387 0.97 Lineal 20 0.3866 0.96 Exponential D. melanogaster Kc167 cells 25 2 0.4 0.96 Radial Protein Extraction Four different SVMs function kernels with automated sigma estimation using the kernlab Rpackage were evaluated. Final model selects only two predictors, the isoelectric point predicted with the Bjellqvist algorithm and the experimental AAindex descriptor from Zimmerman . R2 2 0.32 0.98 SVM algorithm vs Current algorithms Protein Digestion (A) OFFGEL Off-gel Electrophoresis LTQ-FT-ICR 4700 MS/MS Precursor 1570.7 Spec #1 MC[BP = 175.1, 3106] 175.1326 100 3105.9 90 1056.5107 80 1554.7853 70 it e % In t n s y X!Tandem & Peptide Prophet (B) 1571.9679 684.3845 60 X!Tandem 1556.5172 50 40 112.0977 30 1558.4042 246.1672 20 72.1029 0 69.0 813.4371 333.2105 316.1747 120.0979 10 229.1560 400.2173 386.8 480.2749 463.2531 490.3423 1441.7213 741.3559 758.3326 627.3450 629.3128 942.4836 837.0470 910.8679 704.6 1039.4810 1040.9976 1022.4 1171.5131 1268.5427 1340.2 1445.2834 1559.9417 1570.2634 1551.7002 1658.0 Peptide Prophet Mass (m/z) (C) Peptide Identifications Filter by Probability Probability > 0.97 Non-PTMs High Probability Identifications Descriptors Estimation & isoelectric point calculator The theoretical and experimental values are more correlated in the 3.0–4.0 pH range. The average of the standard deviation for the first five fractions for the SVM model, the charge-state and the adjacent algorithm was 0.26, 0.23 and 0.25 respectively. In last five fractions the average of the standard deviation (stdv) was 0.20, 0.52, 0.32 for the SVM model, the charge-state and the adjacent algorithm respectively Detection of possible False positive peptide identifications. Probability References  Cargile BJ, Stephenson JL, Jr. An Alternative to Tandem Mass Spectrometry: Isoelectric Point and Accurate Mass for the Identification of Peptides. Anal Chem. 2004;76:267-75.  Perez-Riverol Y, Sanchez A, Ramos Y, Schmidt A, Muller M, Betancourt L, et al. In silico analysis of accurate proteomics, complemented by selective isolation of peptides. J Proteomics. 2011;74:2071-82.  Bjellqvist B, Hughes GJ, Pasquali C, Paquet N, Ravier F, Sanchez JC, et al. The focusing positions of polypeptides in immobilized pH gradients can be predicted from their amino acid sequences. Electrophoresis. 1993;14:1023-31.  Cargile BJ, Sevinsky JR, Essader AS, Eu JP, Stephenson JL, Jr. Calculation of the isoelectric point of tryptic peptides in the pH 3.54.5 range based on adjacent amino acid effects. Electrophoresis. 2008;29:2768-78.  Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 2008;36:D202-5. 0.7 0.6 0.5 0.4 0.3 0.2 0.1 211687 33492 15960 11244 9780 9540 10200 11556 16212 4344 16893 2791 1330 937 815 795 850 963 1351 362 % peptides Using pICalculator: Bjellqvist pI and Cargile pI 0.8 Non-redundant Peptides Physicochemical and biological properties from AAindex (PD= (∑AD)/NA 0.9 Identified Peptides Using ChemAxon (http://www.chemaxon.com): refractivity index, polarizability, surface area, LogP 1 0.2 2.6 5.9 6.1 9.3 14.0 16.4 16.8 22.6 31.2 Non-redundant 10 34 39 33 45 68 94 113 228 86 Conclusion We combined a SVM approach with only two simple peptide descriptors to predict the isoelectric point of identified peptides, and our results have shown better accuracy than the existing methods. Furthermore, the ability of calculating the pI of peptides to this accurate level is desirable for peptide pI filtering. We envisage that the same approach could also be applied to predict the effect of posttranslational modifications. The use of SVMs and the approach described in this work could be useful for these types of analyses.