Tropsha 4 5 05

50 %
50 %
Information about Tropsha 4 5 05

Published on November 24, 2007

Author: Dabby


Quantitative Genotype Phenotype Relationships (QGPR): Can we learn from Quantitative Structure Activity Relationships (QSAR) modeling?:  Quantitative Genotype Phenotype Relationships (QGPR): Can we learn from Quantitative Structure Activity Relationships (QSAR) modeling? Alexander Tropsha, Sasha Golbraikh, Scott Oloff, Raed Khashan Laboratory for Molecular Modeling School of Pharmacy The unbearable lightness of “predictive” modeling The relationship between target property and attributes (descriptors):  The relationship between target property and attributes (descriptors) Objects Target Property Attributes (Descriptors) Comp.1 Value1 D1 D2 D3 D4 Comp.2 Value2 " " " " Comp.3 Value3 " " " " Comp.N ValueN " " " " - - - - - - - - - - - - - - {TP} = K{Attributes} ^ Predictive biological data modeling: focus on validation :  Predictive biological data modeling: focus on validation QSPR is an empirical data modeling exercise: Choice of statistical data modeling techniques Choice of descriptor types VALIDATE both internally and externally Non-linear methods with variable selection using stochastic optimization techniques to determine context-dependent descriptors Integrated workflow for predictive QSPR modeling Some simple validation techniques and (an example of) the applicability domain definition Examples of studies in QSPR and QGPR areas IT issues Components of QSPR Modeling :  Components of QSPR Modeling Target properties Continuous (e.g., weight) Categorical unrelated (e.g., different phenotypes) Categorical related (e.g., subranges described as classes) Descriptors (or independent variables) Continuous (allows distance based similarity) Categorical related (allows distance based similarity) Categorical unrelated (genotypes; special similarity metrics) Correlation methods (with and w/o variable selection) Linear (e.g., LR, MLR, PCR, PLS) Non-linear (e.g., kNN, RP, ANN, SVM) Validation and prediction Internal (training set) vs. external (test set) Slide5:  VARIABLE SELECTION kNN QSAR* Randomly select a subset of descriptors (HDP) Select the best QSAR model for nvar and K SIMULATED ANNEALING LEAVE-ONE-OUT CROSS-VALIDATION Exclude a compound Predict activity ŷ of the excluded compound as the weighted average of activities of 1 to K nearest neighbors Calculate the predictive ability (q2) of the “model” Modify descriptor subset *Zheng, W. and Tropsha, A. JCICS., 2000; 40; 185-194 Slide6:  Predictive R2 versus cross-validated R2(q2) for QSAR models with q2>0.5. using common definition (e.g., [3]) of training and test sets. Training set: compounds 1-21 Test set: compounds 22-31 Training set: compounds 1-12 and 23-31 Test set: compounds 13-22 BEWARE OF q2!!! (Golbraikh & Tropsha, J. Mol. Graphics Mod. 2002, 20, 269-276. ) 31 Cramer steroids [1] (Benchmark to investigate novel QSAR methods [2]) 1. Cramer, R.D. III, Patterson, D.E., Bunce, J.D. Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J. Am.Chem.Soc. 1988, 110, 5959-5967 2. Coats, E.A. The CoMFA steroids as a benchmark data set for development of 3D QSAR methods. In 3D QSAR in Drug Design. V.3. Kubinyi, H., Folkers, G., Martin, Y.C., Eds. Kluwer/ESCOM:Dordrecht, 1998, pp 199-213. 3. Kubinyi, H.; Hamprecht, F.A. & Mietzner, T. Three-Dimensional Quantitative Similarity-Activity Relationships (3D QSiAR) from SEAL Similarity Matrices, J. Med. Chem., 1998, 41, 2553 – 2564. COMPONENTS OF PREDICTIVE QSAR MODELING WORKFLOW*:  COMPONENTS OF PREDICTIVE QSAR MODELING WORKFLOW* Model Building: Combination of various descriptor sets and variable selection data modeling methods (Combi-QSAR) Model Validation Y-randomization Training and test set selection Applicability domain Evaluation of external predictive power *Tropsha, A., Gramatica, P., Gombar, V. The importance of being earnest:… Quant. Struct. Act. Relat. Comb. Sci. 2003, 22, 69-77. Activity randomization:  Activity randomization Struc.1 Struc.2 Struc.n . . Pro.1 Struc.3 . . Pro.2 Pro.3 Pro.n RATIONAL SELECTION OF MULTIPLE TRAINING AND TEST SETS*:  RATIONAL SELECTION OF MULTIPLE TRAINING AND TEST SETS* *Golbraikh et al., J. Comp. Aid. Mol. Design 2003, 17, 241–253. Slide10:  DEFINING THE APPLICABILITY DOMAIN Training set: 60 compounds Test set: 35 compounds MODEL: Two nearest neighbors The number of descriptors: 8 Q2(CV)=0.57 R2 =0.67 DISTANCES: <D>train=0.287 StDev(D)train=s =0.149 Closest nearest neighbors of test set compounds: Dtest ≤ <D>train+ s ZCutOff (ZCutOff=0.5) N is the total number of distances ( Ntrain=60 2=120; Ntest=70 ) Ni is the number of distances in each category (bin) Slide11:  Criteria for Predictive QSAR Model. Correlation coefficient Coefficients of determination Regression Regression through the origin CRITERIA QSPR modeling process revisited:  QSPR modeling process revisited GENET- GENOM- PROTEOM- BIOINFORMAT- MEDINFORMAT- CHEMOGENOM- CHEMOINFORMAT- PROTEOCHEMOMETR- -ICS “-ics” – an old Latin suffix that means “way too much” COMBINATORIAL QSPRomics, or C-Qics COMBINATORIAL QSPRomics:  COMBINATORIAL QSPRomics C-Qics KNN KNN (MML) BINARY QSAR,… BINARY QSAR,… COMFA descriptors COMFA descriptors Molconn Molconn Z Z descriptors descriptors Chirality descriptors Chirality descriptors Volsurf Volsurf descriptors descriptors Comma descriptors Comma descriptors MOE descriptors MOE descriptors Dragon descriptors Dragon descriptors SAR Dataset SAR Dataset Compound representation Compound representation Selection of best models Selection of best models Model validation Model validation using using external test external test set set and and Y Y - - Randomization Randomization QSAR model QSAR model in in g g SVM SVM (MML) DECISION TREE DECISION TREE Predictive QSAR Workflow:  Only accept models that have a q2 > 0.6 R2 > 0.6, etc. Multiple Training Sets Validated Predictive Models with High Internal & External Accuracy Predictive QSAR Workflow Original Dataset Multiple Test Sets Combi-QSAR Modeling Split into Training and Test Sets Activity Prediction Y-Randomization Database Screening Slide15:  STructure-Activity Relationships for the Design of Molecules (STARDOM™): WORKFLOW Input Structure File Convert Structures dbtranslate Babel etc. MolconnZ GenAP etc. Generate Descriptors Utility (UNC) Normalize Descriptors Descriptor Generation MolconnZ- ToDescr (UNC) Reformat Descriptors Descriptor formatting Input Descriptor File Input Activity File Train & Test Set Selection SE8 (UNC) Build & Test Models Randomize (UNC) Randomize Activities RWKNN, SAPLS (UNC), etc QSAR Algorithm KNNPredict, SAPLSPred (UNC) etc. Predict Test Set Report & Visualize Results ModStat (UNC) Compile Results Weblab, TSAR, MOE, Spotfire, etc. Visualize Results Database to Screen Screen Database Utility Normalize Descriptors DBMine, KNNPredict, etc. Mine Database TSAR, MOE, Spotfire, etc. Visualize Hits QSAR Model(s) programs functions User input Predictive QSAR workflow as an automated grid application (currently based on IBM’s middleware):  Predictive QSAR workflow as an automated grid application (currently based on IBM’s middleware) Browser Portal Server WebSphere Application Server WebSphere Work Flow Java Wrappers Applications run on the Computer Grid kNN SVM’s, etc. kNNPredict, SVMPredict, etc. Relational Database (DB2 or Oracle) File Database (Data Grid) Screening of Compound Databases Visualization Tools (Spotfire, ChemDraw, Chime, etc.) KEY(↔): Initial Model Building Flow Screen Database Flow Data Retrieval and Visualization Slide17:  EXAMPLE 1: COMBINATORIAL QSAR OF AMBERGRIS FRAGRANCE COMPOUNDS* Amber, woody, cedarwood, animal, strong Amber woody, camphoraceus, spicy, weak Amber, exotic woody, animal Strong amber Amber woody, sea water Amber, camphoraceus *Kovatcheva A., et al. J. Chem. Inf. Comp. Sci., 2004, 44, 582-95 Slide18:  TOTAL PREDICTION ACCURACY FOR THE TEST SET USING BEST ACTUAL & RANDOMIZED MODELS Example 3. Consensus QSAR models for the prediction of Ames genotoxicity*:  3,363 diverse compounds (including >300 drugs) tested for their Ames genotoxicity 60% mutagens, 40% non mutagens 148 initial topological descriptors ANN, kNN, Decision Forest (DF) methods 2963 compounds in the training set, 400 compounds (39 drugs) in randomly selected test set Example 3. Consensus QSAR models for the prediction of Ames genotoxicity* *Votano JR, Parham M, Hall LH, Kier LB, Oloff S, Tropsha A, Xie Q, Tong W. Mutagenesis, 2004, 19, 365-77. Comparison of GenTox prediction for 30 drugs in external test set:  Comparison of GenTox prediction for 30 drugs in external test set Content-dependent descriptor types identified by different models (LogP was never selected):  Content-dependent descriptor types identified by different models (LogP was never selected) Effect of applicability domain on the prediction accuracy of kNN QSAR:  Effect of applicability domain on the prediction accuracy of kNN QSAR Genomic Butterfly Spot Dataset:  Genomic Butterfly Spot Dataset 2000 Data examples with presence or lack of phenotype. 6 developmental loci result in the phenotype. 30 additional loci added as noise HYPOTHESIS: Our well developed QSPR-omics methodologies can be accurately applied to QGPR to identify the developmental loci kNN Results (Traditional):  kNN Results (Traditional) 70-90% Training set accuracy however phenotypes were predicted differently with identical selected descriptor values New “k”NN for QGPR:  New “k”NN for QGPR If more than “k” elements have identical selected descriptors then we average all of those elements rather than the first “k”. ONLY the descriptors c_source and c_thresh were found to be relevant Slide26:  SVM Classification Slide27:  SVM Classification Descriptors Found Identified by SVM:  Descriptors Found Identified by SVM Recursive Partitioning using DTReg:  Recursive Partitioning using DTReg Random Forests using DTReg:  Random Forests using DTReg Shuffled Difficult Data Structures to model:  Difficult Data Structures to model “k”NN works well SVM-RBF works well Decision Trees: no correlation Random Forest: no correlation CLASSIFICATION ACCURACY CRITERIA AS TARGET FUNCTIONS IN QSAR :  CLASSIFICATION ACCURACY CRITERIA AS TARGET FUNCTIONS IN QSAR Alexander Golbraikh April 5, 2005 Slide33:  2x2 CONFUSION MATRIX AND MEASURES OF CLASSIFICATION ACCURACY N=A+B+C+D B+D A+C TOTAL C+D D C PREDICTED(-) A+B B A PREDICTED(+) TOTAL ACTUAL(-) ACTUAL(+) Kappa + B/(B+D) False positive rate Enrichment + Odds ratio + D/(B+D) Specificity (Sp) Misclassification rate + A/(A+C) Sensitivity (Ss) Negative predictive power (NPP) (A+D)/N Correct classification rate Positive predictive power (PPP) (B+D)/N Overall diagnostic power + False negative rate (A+C)/N Prevalence Fielding, A.H.; Bell, J.F. Environmental Conservation 1997, 24 (1), 38-49. C/(A+C) A/(A+B) D/(C+D) (B+C)/N (AD)/(BC) AN/[(A+B)(A+C)] {(A+D)/N-[(A+C)(A+B)+(B+D)(C+D)]/N2}/ {1-[(A+C)(A+B)+(B+D)(C+D)]/N2} Slide34:  DRAWBACK OF SOME CHARACTERISTICS 100 20 80 Total 34 14 20 Predicted (-) 66 6 60 Predicted (+) Total Actual (-) Actual (+) 28 20 8 Total 16 14 2 Predicted (-) 12 6 6 Predicted (+) Total Actual (-) Actual (+) PPP=60/66=0.91 Prev=80/100=0.80 E=0.91/0.80=1.14 PPP=6/12=0.50 Prev=8/28=0.29 E=0.50/0.29=1.72 Slide35:  NORMALIZED CONFUSION MATRICES 70/70+340/340 340/340 70/70 Total 28/70+280/340 280/340 28/70 Predicted (-) 42/70+60/340 60/340 42/70 Predicted (+) Total Actual (-) Actual (+) 2 1 1 Total 1.22 0.82 0.40 Predicted (-) 0.78 0.18 0.60 Predicted (+) Total Actual (-) Actual (+) PPP=0.60/0.78=0.77 Prev=1/2=0.50 E=0.77/0.50=1.54 Slide36:  THE NORMALIZED CONFUSION MATRIX AND CLASSIFICATION ACCURACY MEASURES + Kappa + B/(B+D) False positive rate + + (AD)/(BC) Odds ratio + D/(B+D) Specificity (Sp) + Misclassification rate + A/(A+C) Sensitivity (Ss) + Negative predictive power (NPP) + Correct classification rate (CCR) + Positive predictive power (PPP) (B+D)/N Overall diagnostic power + C/(A+C) False negative rate (A+C)/N Prevalence 2 1 1 Total C/(A+C)+D/(B+D) D/(B+D) C/(A+C) Predicted(-) A/(A+C)+B/(B+D) B/(B+D) A/(A+C) Predicted(+) Total Actual(-) Actual(+) CLASSIFICATION QSAR: nxn NORMALIZED CONFUSION MATRIX:  CLASSIFICATION QSAR: nxn NORMALIZED CONFUSION MATRIX CLASSIFICATION ACCURACY: CONSIDERATIONS:  CONSIDERATIONS Many parameters used for evaluation of classification accuracy cannot be used as characteristics of QSAR models, because they depend on the size of each class. These parameters become independent of the size of each class, if they are calculated using normalized confusion matrices. n2-n linearly independent parameters are necessary to fully characterize the performance of classification accuracy algorithms. When we are not interested in the classes to which misclassified compounds are assigned, n diagonal elements of the normalized confusion matrix are sufficient to estimate the algorithm performance. Set of criteria, which good classification models must satisfy, were established. Decision Tree (MOE): data:  Decision Tree (MOE): data Dataset 1 and 2 2000 objects, 36 descriptors External test set: 400 objects (used for prediction) Class 1: 200 objects Class 2: 200 objects Training set: 1200 objects (used for learning a tree) Class 1: 600 objects Class 2: 600 objects Internal test set: 400 objects (used for pruning a tree) Class 1: 200 objects Class 2: 200 objects Decision Tree (MOE): parameters:  Protocol: separate test sample Descriptors included: 36 or 34 Node Split Size: 10 Max. Sample Size: 255 Max. Tree Depth: 10 Best Tree Thresh: 1.0 0.8* 0.6* 0.4* Use Priors * With 34 descriptors only Decision Tree (MOE): parameters Slide41:  All 36 descriptors The trees included only two descriptors: c_source and c_thresh Prediction accuracy for BOTH DATASETS:* Training+Internal Test sets: 100% External Test set: 100% Decision Tree (MOE): results * Result has been checked using EXCEL: Pairs of c_source and c_thresh values uniquely define object class for whole datasets! Slide42:  34 descriptors: c_source and c_thresh were excluded Prediction accuracy for BOTH DATASETS: BAD Decision Tree (MOE): results Decision Tree (MOE): conclusions:  Model based on only two descriptors, c_source and c_thresh, predicts the classes with the accuracy of 100%. There are no other important descriptors in the dataset. Decision Tree (MOE): conclusions Summary:  Summary Predictive QSPR workflow affords statistically significant models which can be used directly for database mining. Extensive model validation is a must! Consensus screening is more effective than using single models Model building should be ongoing process concurrent with experimental validation and model enrichment  integrated workflows The public has an insatiable curiosity to know everything, except what is worth knowing. Oscar Wilde ACKNOWLEDGMENTS:  ACKNOWLEDGMENTS UNC ASSOCIATES Former: -Stephen CAMMER -Sung Jin CHO -Weifan ZHENG - Min SHEN - Bala KRISHNAMOORTHY Protein structure group: John GRIER Luke HUAN Ruchir SHAH Shuxing ZHANG Shuquan ZONG Peter Itskowitz Funding NIH NSF NCI-BSF Berlex, IBM, MCNC, GSK, Inspire, Millennium, Ortho-McNeil QSAR group: Alex GOLBRAIKH Raed KHASHAN Scott OLOFF Kun Wang Mei Wang Chris Grulke Jun FENG Yun-De XIAO Yuanyuan QIAO Patricia LIMA Assia KOVACHEVA M. KARTHIKEYAN Current

Add a comment

Related presentations

Related pages

Feiertage am 4. Mai - insb. Feiertage am 4.5.2017

Übersicht: Alle Feiertage am 4. Mai 2017, 2018 etc. und in welchem Bundesland in Deutschland der 4.5. ein (gesetzlicher) Feiertag ist.
Read more

LSG NRW - L 4 B 5/05 - Beschluss vom 02.05.2005

LSG NRW - Beschluss vom 02.05.2005 - Az.: L 4 B 5/05. Sachverständigenentschädigung: Nochmals die vier vergütungspflichtigen Arbeitsschritte:
Read more

4. Mai: Geburtstage am 04.05. ·

Geburtstage, Todestage, Ereignisse am 4.5.: Berühmte am 4. Mai geborene Personen und Promis wie Audrey Hepburn und Inger Nilsson.
Read more

5. Mai – Wikipedia

4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31 ... 2005: Das Datum 05.05.05 veranlasst ...
Read more

BVerwG, 05.03.1999 - 4 B 5.99 -

Informationen zur Entscheidung BVerwG, 05.03.1999 - 4 B 5.99: Volltextveröffentlichungen, Zeitschriftenfundstellen, Wird zitiert von ...
Read more

Gewächshaus mit Glas unschlagbar stabil 2,5 x 3,05 m ...

4,5 von 5 Sternen 2. ... Gewächshaus mit Glas unschlagbar stabil 2,5 x 5,05 m, Konstruktion Metall verzinkt EUR 2.334,00. Weiter. Produktinformation.
Read more

Axing SPU 56-05 Multischalter 5 in 6, SAT aktiv: ...

Axing SPU 56-05 Multischalter 5 in 6, SAT aktiv: Elektronik ... SAT Multischalter DCT Delta 5/4 mit Netzteil Switch 5 in 4 FULLHD 3D Digital
Read more

4. Mai – Wikipedia

Der 4. Mai ist der 124. ... 2007: Ein Tornado der Klasse EF-5 auf der Fujita-Skala zerstört über neunzig Prozent der Stadt Greensburg in Kansas.
Read more

Golbraikh - 1 - Pipl

Alexander Golbraikh and Alexander Tropsha. University of North Carolina at Chapel Hill, Chapel Hill, NC. Based on the nature of the response variable ...
Read more