T0386: some easy. two ends (30 residues each) are hard final decision on domain: predict as two domains: a) domain parser predict three fragmetns b) fragment 1 and fragment 2 are very close, so they should be in the same domain d) domain fusion mechanims? e) alignment signal? f) secondary stx pattern: the last end, there a lot of beta-sheet. use human 5, domain parser cuts it into three domains. if we check the two fragments at both ends, they are very close. in this case, two fragments are joined into one domain. (domain fusion mechanism) Also, we can check alignment boundary. so we can cut it into two domains: 1-33 and 263-end: domain 1. 34-262: domain 2 predict: 1-33 and 263-end form a a-b-a domain. according to CDD blast, it seems to have two domains too. one is filament domain, another one is unknown (filament domain has a HPFXGNG motif which is aligned well in both psi-blast and lobster alignment) it looks like lobster alignment is better in this case? CDD provides evidence that this protein is probably two-domain. foldpro: use human3 as model 1 use human 5 as model 3 final: use human 0 as model 1. 3dpro: keep model 3 use human 3 as model 1 use human 2, 4, 5 as model 2, 4, 5. 2F6S: filamentation protein 2g03: filamenttation protein cm_ab seems to be the best, but is has a knot. now use 2f6s and 2g03 to combine with ab to generate models respectively. T0385: hard decision: use human models 1-5 as models 1-5 for foldpro and 3dpro respectively. select five models for foldpro and 3dpro each from top 150 models compare foldpro 1-2: 6.6 1-3: 5.7 1-4: 5.7 1-5: 5.5 2-3: 6.1 2-4: 5.6 2-5: 5.5 3-4: 5.3 3-5: 5.3 4-5: 5.5 compare 3dpro: (all belong to ferritin family according to scop) 1-2: 5.7 1-3: 6.3 1-4: 5.6 1-5: 6.5 2-3: 5.7 2-4: 6.3 2-5: 5.7 3-4: 5.5 3-5: 6.6 4-5: 5.7 compare 3dpro model 1 and foldpro model 1: Rmsd = 1.2Å Z-Score = 6.7 Sequence identity = 98.0% Aligned/gap positions = 148/2 generate stx for top 150 templates foldpro: model 5, 7 better 3dpro: model 6,7 (ab initio) are better. --------------------------------------------------------------------------------------------------------- observation: foldpro alignment often fails for multi-domain protein even though the correct template is identified. the reason is: one correct domain, one wrong domain. --------------------------------------------------------------------------------------------------------- T0384: easy foldpro: keep model 1, use easy1-4 as model 2-5 3dpro: keep model 1, use easy2-5 as model 2-5 T0383: hard foldpro only model 6 is good. use human 1-4 as model 1-4 model 6 as model 5 human models are generated from top 150 tempaltes and evaluated by model_check. 3dpro: only ab-initio is slightly better. use human 1-4 as model 1-4 ab model as model 5. -------------------------------------------------------------------------- *********************************************************** Weakness of model_check: an extended helix that match with secondary stx well, can have high score. the stx is not compact, usually the sov acc matching score is very low. We need to enlarge dataset to train? ************************************************************ In this case, sa score is usually about 50%. ss score can be high (if one ss dominates) or low. We need to add this kind of examples to penalize it or we need to add a compactness term. good selection criteria: predicted gdt-ts > 0.30 sa matching score > 0.6 ss matching score > 0.5 FOLD recognition using model_check evaluation: create a smaller library (25% identity) generate an alignment and a model for query using each template use model_check to evalute the model and generate a evaluation score select model: with gdt > 0.3, sa > 0.6, ss > 0.5 Then rank final models using gdt scores and return top ranked models penalize models with high gdt but low sa match scoer (<0.55). these models are usually just an extended string. usually we can halve the score. ------------------------------------------------------------------------- T0382: hard target. sspro: a helical protein. the predictions of 3dpro and foldpro are not good. the ab-initio model's score is very high. now submit T0382 to 3Djury. 3dpro: use model 8 (ab initio ) as model 1 use human 1-4 as model 2-5 manually select 4 models from top 130 models generated by modeller as model 2-5. foldpro: use manually selected 4 models on 3dpro. now generating another 150 stx on foldpro (mine4) T0381: easy 3dpro: use model 2 as model 1 use model 1 as model 2 use easy 1,3,4 as model 3-5 FR system: difficulty in 2-domain protein. only the N domain 3-helical bundle is identified. second domain is not right. the problem of global alignment. domain 1: 1-87 domain 2: 88-end. foldpro: keep model 1,2 use easy 1,3,4 as model 3-5 more change: use easy 1 as model 1 use current model 1 as model 3. --------------------------------------------------------------------------------------------------------------- ************next CASP, must train a system specific for FR/A.*************** **************************************************************************** another idea: generate one model using modeller seems to be very fast. for hard target, we acutually, we generate thousands model, then use energy function to evaluate. it may only take one day. In this case, it is very likely we can identify good models? Actually, this is a new FR idea, especially for hard target (FR/A or some kind of NF) idea: build a library without sequence identity (25%) (at most 3000 proteins) align query against the sequence in the template, and generate a 3D models using modeller use model check to evaluate, record the score pick a few templates with maximum scores. ******************************************************************************* We can add this layer right after psi-blast. or add it after foldpro if time permits. paper: Protein Fold Recognition Using Model Quality Evaluation ---------------------------------------------------------------------------------------------------------------- 7/17/2006 The script of generating a stx from a specified template is done. gen_stx_from_pdb_code.pl. the option file is in mine 3: /var/preserve/prosys/web/htdocs/T0372/gen_stx_option inputs: option file, query_fasta_file, template id(pdb_code + chain id), output dir now we can use 3d-jury to help us to find templates for hard targets. test on T0372 to use 1ne9a to generate a stx in /var/preserve/prosys/web/htdocs/T0372/human use model_check to evaluate score: 0.3842301. The score is much better than our previous score, but still significantly lower than SP2 (62). that means sp3 has better alignment than Lobster. CE alignment betwen our human model and sp3 model Rmsd = 3.1Å Z-Score = 6.7 Sequence identity = 29.5% Aligned/gap positions = 258/67 ------------------------------------------------------------------------------------------------------------------ T0380: easy foldpro: replace model 5 with easy model 3. replace model 2 with model 7. 3dpro: use model 7 and model 8 as model 3, 4. use easy model 5 as model 5. ----------------------------------------------------------------------------------- Post analysis T0372 is a hard target, why most groups find 1ne9, but we didn't?????????????????????!!!!!!!!!! most people are using 3D-Jury to identify templates?????????????????????????????????? use model check, score of sp3 based on 1ne9 is 0.62. looks like it is a correct template. next CASP, we need to use 3d-jury when no templates are identified? try to use 3d-jury, 1ne9 is ranked as no. 2. we need a script to general an pir alignment given a pdb id using sequence in our own database. then we can generate a stx. *****************NEW IDEAS******************** 1ne9 is ranked about 100. Another way: for hard target, select top 200 templates, generate 200 models, then run model check, select a few with highest scores. Add another layer to the hard target (FR/A): another trick is to rank templates using a few alignment scores individually (hhsearch, prc-hmm, compass, psi-blast, hmmer to make a internal meta prediction, lobster and so on????). for T0272, our domain prediction is also wrong. actually, there is a domain boundary signal in the alignments. adjustment: for the last six targets, submit the targets to 3D-Jury too. so if we see some difficulty, we can use 3D-Jury. new strategy for hard targets: use foldpro ranking (top 200) use 3djury ranking (we need a script to submit tasks to 3d-jury and parse output) use alignment ranking in foldpro then generate a model using a different templates, then use model check to evaluate and choose models. **************BIG FAILURE*********************************** NEXT TIME, WE NEED TO USE 3D-JURY BAKER USE 3D-JURY, MANY GROUPS USE 3D-JURY. WE ALSO NEED TO BUILD OUR OWN 3D-JURY SYSTEM using EXISTING ALIGNMENT METHODS. ************************************************************ Another lesson: we need to use domain bounary signal explicitly encoded in sequence alignment. important lesson: (******************VERY VERY VERY VERY IMPORATNT LESSION***********************) we need to avoid failure of FR. trick: use more rankings, select more tempaltes evaluate by model_check, and use external FR (such as 3D-jury) Key steps in next casp: meta-server 1. external meta server once receiving a casp target, submit it to all other servers. results are send to a special mail box. it can be check my human when target is hard. or the server automatically process it. the model is then evaluated by model_check the external server include "3D-JUry". 2. internal meta server use alignment score to rank templates. select top from each alignment methods then generate models from top templates. then use model_check to evaluate. Why do we need meta server? 1. meta server can compete by itself. and we have model check to evaluate models. 2. take model from meta server when foldpro encounters difficulty and may miss some templates. by doing this, we can avoid fatal failure of foldpro (like T0372, everybody get it right, but we didn't.) ----------------------------------------------------------------------------------------------------- CASP 7 experience Problem identified with foldpro: good: templates can be identified. bad: profile-profile alignment quaility is not good, especially when invovling muti-domain protein combination is not good: a) alignment quality is not good b) don't remove inconsistent templates. ---------------------------------------------------------------------------------------------------- How to evaluate if a relative small region is a domain or just an extension of another domain? 1. length (> 50?) 2. fold independently (does it rely on other domain?) 3. structure or function unit? 4. alignment signal? 5. any mechanisms: domain/gene fusion if it is an insertion in the middel? at two ends, is it just a dangling resion, protrusion region, or linker and so on? 6. if a region consist of a few insertions, it probably not a domain, because, a simple gene fusion can't produce this kind of domain structure. ---------------------------------------------------------------------------------------------------- T0379: easy foldpro: keep cm 1, use easy 1-4 as model 2-5 domain: visual inspection: should be two domains. but domain parser classified it into one domain. (we need to try other tools (Xu's tool), Sander's tool, or consulting CATH and SCOP database). check templates: 1CQZ: single domain: had-like 1VJ5: single domain: had-like 1S8O: single domain: had-like 2B0C: ? According to the T0379.align, there is domain cutting signal at about position 90. so let it be two domains. domain 1: 1-15, 87-end. domain 2: 16-84 Our algorithms: look at alignment, domain boundary signal. check scop / cath domain definition, if consistent use. check scop domain definition, domain parser, Xu's domain parser, Sanders's domain parser take the majority vote. script to check to avoid the cutting in the middle of beta-sheet. 3dpro: keep model 1, use easy2-5 as model 2-5. Further domain analsis using templates identified by foldpro: 1L7M: scop: HAD like domain, catch: rossman fold: single domain 1Q92: same as above 1NNL: scop: had-like, cath: rossmann fold 1k1e: same exception: 1RQL: scop: had like. cath: domain 1: rossman fold, domain 2: orthogonal bundle 1ZRN: scop: had like, cath: two domains. IMPORTANT NOTICES: even for scop, it says the fold contains an insertion of a (sub)domain. for d1ZRN__ # Root: scop # Class: Alpha and beta proteins (a/b) [51349] Mainly parallel beta sheets (beta-alpha-beta units) # Fold: HAD-like [56783] 3 layers: a/b/a; parallel beta-sheet of 6 strands, order 321456 # Superfamily: HAD-like [56784] contains an insert alpha+beta subdomain; similar overall fold to the Cof family usually contains an insertion (sub)domain after strand 1 # Family: L-2-Haloacid dehalogenase, HAD [56785] the insertion subdomain is a 4-helical bundle # Protein: L-2-Haloacid dehalogenase, HAD [56786] That means, proteins in scop domain database can actually have contain small domain. scop usually doesn't cut the closely coupled domains. The key issue is: do we treat it as a domain or subdomain? T0378: easy foldpro: keep model 1-3, use easy 1, 3 as model 4, 5. 3dpro: keep model 1-3, use easy 1, 3 as model 4, 5 compare foldpro cm1 with 3dpro cm1 Rmsd = 0.9Å Z-Score = 7.6 Sequence identity = 92.7% Aligned/gap positions = 246/10 compare 3dpro cm1 vs foldpro easy 1 Rmsd = 1.4Å Z-Score = 7.4 Sequence identity = 90.7% Aligned/gap positions = 246/12 compare foldpro cm 1 vs 3dpro easy 1 Rmsd = 1.0Å Z-Score = 7.5 Sequence identity = 95.2% Aligned/gap positions = 249/4 compare foldpro easy 1 with 3dpro easy 1 Rmsd = 1.0Å Z-Score = 7.4 Sequence identity = 91.5% Aligned/gap positions = 247/12 T0376: easy foldpro: templates: 1HL2A, B: fold: tim beta/alpha barrel, super: aldolase, fam: class I aldolase. scop/cath: 1 domain. 1FDYA, B: same as above 1NAL1: same as above 1DHPA: same as above 1S5VA: same as above 1S5WA: same as above 2A6L: ? 1S5T: same as above 2A6N: ? 1XKY: ? 1O5K: same as above 1F6K: same as above 1W3I: same as above So both cath and scop classify these templates into one domain. for this target, there are a few helices dangling at the C terminus. but they may not be a independent domain in its own right. but the last region really looks like an orthognal helix bundle. these extended helices may be functional important regions? so final decision: single domain (unchanged so far). keep model 1 and 2, use easy 1,3,5 as model 3,4,5. 3dpro: use easy model 1-3 as models 3, 4, 5. T0375: easy foldpro: keep model 1 and 2 use easy 1-3 as mode 3-5. domain is ambiguious: 1LIO scop: one domain cath: two domains: 3-layer(aba), 2-layer sandwich This protein, I somewhat prefer one domain. the extended small beta-sheet is too complex. may not be a domain by itself. scop description: # Root: scop # Class: Alpha and beta proteins (a/b) [51349] Mainly parallel beta sheets (beta-alpha-beta units) # Fold: Ribokinase-like [53612] core: 3 layers: a/b/a; mixed beta-sheet of 8 strands, order 21345678, strand 7 is antiparallel to the rest potential superfamily: members of this fold have similar functions but different ATP-binding sites # Superfamily: Ribokinase-like [53613] has extra strand located between strands 2 and 3 # Family: Ribokinase-like [53614] # Protein: Adenosine kinase [53617] # Species: Toxoplasma gondii [53619] so scop consider the protruded strands as extra strand, not a (sub) domain. 1LIJ: same as 1LIO 2ABS 1DGM 1LII 2A9Y domain decision: I would rather treat the beta-sheet as two long loops collide with each other to form a sheet. they may be just a function important site. this is also a problem of domain parser which cut in the middle of one beta-sheet (position at 133) so, if we say we have a problem to avoid domain parser to cut in the middel of a beta-sheet, this problem could be avoided. That is why we manually tweak the domain prediction. 3dpro: use easy1-4 as model 1-4, use current cm model 1 as model 5. foldpro cm vs foldpro easy1: Rmsd = 1.4Å Z-Score = 7.6 Sequence identity = 95.2% Aligned/gap positions = 289/4 foldpro cm vs foldpro easy2 Rmsd = 1.8Å Z-Score = 7.6 Sequence identity = 93.8% Aligned/gap positions = 290/8 foldpro easy1 vs easy 2 Rmsd = 2.1Å Z-Score = 7.5 Sequence identity = 94.4% Aligned/gap positions = 288/8 3dpro cm vs foldpro cm Rmsd = 2.1Å Z-Score = 7.5 Sequence identity = 93.7% Aligned/gap positions = 287/16 3dpro cm vs 3dpro easy 1 Rmsd = 2.3Å Z-Score = 7.5 Sequence identity = 95.8% Aligned/gap positions = 286/16 3dpro cm vs 3dpro easy 2 Rmsd = 2.2Å Z-Score = 7.5 Sequence identity = 95.8% Aligned/gap positions = 285/16 foldpro cm vs 3dpro easy 1 Rmsd = 1.4Å Z-Score = 7.7 Sequence identity = 100.0% Aligned/gap positions = 290/0 foldpro cm vs 3dpro easy 2 Rmsd = 1.0Å Z-Score = 7.8 Sequence identity = 100.0% Aligned/gap positions = 295/0 3dpro easy 1 vs foldpro easy 1 Rmsd = 1.7Å Z-Score = 7.5 Sequence identity = 96.2% Aligned/gap positions = 286/8 T0374: easy 3dpro: keep model 1,2 use models 6, 7 as models 3,4 use easy1 as model 5. foldpro: keep model 1 use 3dpro model 1 as model 2 keep model 3 use mode 8 as model 4 use easy model 1 as model 5 T0273: easy 3dpro: keep model 1 use easy1-4 as model 2-5 (very good model 1, it extends one cm template longer than the same template used by foldpro) that means longer psi-blast alignment is better. sometime extension manually is necessary. FOLDpro: keep model 1 use 3dpro model 1 as model 2. use easy1,2,5 as model 3,4,5. compare the model 1 of 3dpro and foldpro: Rmsd = 2.2Å Z-Score = 6.2 Sequence identity = 88.1% Aligned/gap positions = 135/7 T0372: hard (WE FAILED THIS TARGET, SHOULD CHECK 3D-JURY, A GOOD TEMPLATE IS FOUND: 1NE9.) 3Dpro: use model 7 to replace model 2 foldpro: replace model 4 with model 6. (too big distance between two residues in the model) replace model 5 with model 1 of 3dpro (too big distance between two residues in the model) T0271: easy foldpro: domain: 1-80/203-end, 81-202 (right now the range is set at 1-50) New ideas: FURTHER IMPROVEMENT of DOMAIN CUTTING Algorithms. The same beta-sheet must be put in the same domain. (need a program to adjust the domain cutting according to beta-sheet pairings for known structures) use easy model 1-4 as model 1 use current model 1 as model 5. 3dpro: use easy model 1-4 as model 1-4 use current model 1 as model 5. T0270: fr 3dpro: unchanged. foldpro: ?use model 1 of 3dpro as model 1 of foldpro. ?the model 1 of foldpro includes too many templates, some of them are not very good. **********final decision: model 1 is actually ok. so unchanged.************ T0369: hard foldpro: use model 7 to replace model 1 3dpro: use model 4 as model 1, model 6 as model 4. T0368: fr foldpro: change it to one domain done. exchange model 1 and model 3. done. 3dpro: resubmit model 3 due to error. (1A17_) replace model 2 with model 6. resubmit model 5 due to error done. use human model to replace model 1. human model extend the cm template. the current model 1, the last two helix is too far away. T0367: fr compare model 1 of 3dpro and foldpro: Rmsd = 1.1Å Z-Score = 6.7 Sequence identity = 100.0% Aligned/gap positions = 124/0 3Dpro: identify three positive templates 1 1UFBA 1.09 (reso: 1.9, fold: four-helical up/down bundle, super: Nucleotidyltransferase binding, fam: HEPN domain) 2 1WOLA 0.95 (reso: 1.62, HEPN protein) 3 1O3UA 0.88 (reso: 1.75, fam: same as 1UF8: HEPN domain) So all three are consistent. frcom can be used as top. No changes are necessary. FOLDPRO: 1 1O3UA 1.52 2 1UFBA 1.43 3 1WOLA 1.31 Use model 8 as model 5. T0366: easy two cm models of foldpro and 3d pro are exactly same. FOLDpro: keep cm as model 1, frcom as model 2 use easy 1, 2, 5 as model 3, 4, 5. cm vs easy 1 Rmsd = 0.7Å Z-Score = 6.5 Sequence identity = 100.0% Aligned/gap positions = 104/0 Rmsd = 0.6Å Z-Score = 6.3 Sequence identity = 100.0% Aligned/gap positions = 101/0 Rmsd = 1.5Å Z-Score = 5.9 Sequence identity = 100.0% Aligned/gap positions = 92/0 cm vs frcom Rmsd = 2.3Å Z-Score = 5.9 Sequence identity = 100.0% Aligned/gap positions = 93/0 cm vs cm of 3dpro Rmsd = 0.0Å Z-Score = 6.6 Sequence identity = 100.0% Aligned/gap positions = 106/0 3Dpro: keep cm model 1. use easy model 1,2,3,5 as model 2-4. cm vs easy1 Rmsd = 0.7Å Z-Score = 6.5 Sequence identity = 100.0% Aligned/gap positions = 104/0 cm vs easy2 Rmsd = 0.6Å Z-Score = 6.3 Sequence identity = 100.0% Aligned/gap positions = 101/0 cm vs easy 3 Rmsd = 0.4Å Z-Score = 6.3 Sequence identity = 100.0% Aligned/gap positions = 96/0 cm vs easy 5 Rmsd = 1.6Å Z-Score = 5.9 Sequence identity = 100.0% Aligned/gap positions = 93/0 easy 1 vs easy 2 Rmsd = 0.8Å Z-Score = 6.3 Sequence identity = 100.0% Aligned/gap positions = 102/0 easy 1 vs easy 3 Rmsd = 0.6Å Z-Score = 6.2 Sequence identity = 100.0% Aligned/gap positions = 97/0 easy 1 vs easy 5 Rmsd = 1.4Å Z-Score = 5.9 Sequence identity = 100.0% Aligned/gap positions = 93/0 easy 2 vs easy 3 Rmsd = 0.7Å Z-Score = 6.2 Sequence identity = 100.0% Aligned/gap positions = 98/0 easy 2 vs easy 5 Rmsd = 1.8Å Z-Score = 5.9 Sequence identity = 100.0% Aligned/gap positions = 93/0 easy 3 vs easy 5 Rmsd = 1.5Å Z-Score = 5.9 Sequence identity = 100.0% Aligned/gap positions = 92/0 --------------------------------------------------------------------------------------- **********FURTHER IMPROVE MODEL_CHECK********** One weakness of model_check: it gives high score to a extended alhph-helix. that means it doen'st have a term to favor compactness. (one hard target model between 350-364....need to find it later) So to futher improvement is to add compactness term such as gyration? and add other measures such as energy terms: verify3d, parosa, procheck, skoknick, baker terms and so on???? --------------------------------------------------------------------------------------- Interesting observation for this target: **************************** use BETApro/CMAPpro, no very long range contacts beyond 20 separation are predicted for this simple, helical protein (all of them are helix-loop-helix motifs). that means it is very difficult to predict long-range tertiary contact of this kind of simple topology protein (supposed to fold very fast according to B. Nolting). We may try to design a speicial method to predict the helix orientation or helix contact for this case. ***************************************************** T0365: fr 3Dpro: cm find one match 1T8B (evalue 0.005), not very significant. FR find three significant match 1 1T72A 1.28 2 1XWMA 1.06 3 1SUMB 1.02 4 1VCTA -0.27 5 1I6ZA -0.3 6 1HX1B -0.33 Apparently, the cm model should not be put on the top in this case. (cmfr, cm_ab, frcom, fr1, fr2) Acturally, cm template, 1T8B is also a PhoU-Like phosphate update regulator (Reso=3.23, R=0.216) Decision: use model 3 (frcom) as model 1. use model 1 as model 2. keep model 4 and model 5. (fr1,fr2) use model 6 (fr3) as model 3. FOLDpro: 1 1SUMB 1.98 (scop: all alpha, fold: spectrin repeat lik, super: Phou-like, fam: Phou-like) reso=2, r=0.22 2 1T72A 1.86 (PhoU homolog, reso=2,9, R=0.216) 3 1XWMA 1.32 (Phou phosphate update regulator, reso=2.5, r=0.243) 4 1VCTA 0.02 (potassium channel related) 5 1I6ZA -0.46 (scop: mainly alpha, fold: spectrin repeat like, super: BAG domain) --- fold identification is correct current models (frcom, fr1, fr2, fr3, fr4) DECISION: keep model 1-4. use frcom model of 3dpro as model 5. Compare frcom of foldpro with frcom of 3dpro: Rmsd = 2.1Å Z-Score = 7.0 Sequence identity = 81.5% Aligned/gap positions = 211/8 T0364: easy FOLDpro: use easy 1,2 as model 4, 5. 3dpro: use easy1-3 as model 3-5. compare model 1 with easy model 1,2 to see similarity. T0363: FR 3Dpro: use model 6 to replace model 5. FOLDpro: use model 1 of 3Dpro as one model 3 of FOLDpro. T0362: easy FOLDpro: keep model 1 (cmfr) and 2 (cm). use easy1-3 as model 3,4,5 3Dpro: keep model 1 and model 2 use easy1-3 as model 3, 4, 5. T0361: hard 3Dpro: replace model 5 with model 7. FOLDpro: unchanged. T0360: FR 3Dpro: replace model 5 with model 7 (ab-initio) FOLDpro: unchanged. T0359: easy FOLDPRO: use easy 1,2,3,4, as model 2, 3, 4, 5. keep original model 1(cm) 3Dpro: use easy 1,2,3,4, as model 2, 3, 4, 5. keep original model 1(cm) T0358: hard 3dpro: unchanged. foldpro: use model 2, 1, 6 as model 1, 2, 3 T0357: hard FOLDpro: replace model 2 with model 6 replace model 4 with model 7 3Dpro: use model 1 of foldpro as model 1 use model 1 of 3dpro to replace model 4. ------------------------------------------------------------------------------------------------------------- POST ANALYSIS: (weakness of using multiple templates when there are very significant match) T0310: AN EXMAPLE WE USE A LOT OF TEMPLATES, BUT THE BEST SERVER USE ONLY ONE TEMPLATE. OUR SCORE IS SLIGHTLY LOWER THAN OTHERS ACCORDING TO CE ALIGNMENT. the evalue of the first template: 1O20A is e-152, coverage is also very high. it should be used only instead of using a lot of templates. This problem should not happen in next CASP since we have use very significant cm match on top instead of combining them. T0289: an example that CM only find a portion of match. In this case, we either use significant FR templates (full length) as FOLDpro did. or try to add some FR into cm alignments to cover the full sequence. In any case, we must combine templates or extend existing CM templates to cover the whole sequence. ------------------------------------------------------------------------------------------------------------- Lesson learned: (A POSSIBLE NEW APPROACH FOR PROTEIN DOMAIN PREDICTION, PARTICULARLY FOR AB-INITIO) AB-intio T0347: according to secondary stx, the first part is alpha/beta, the second part is alpha. So this should be classifiy into two domains (most method classify this into two domains). our DOMpro also has a signal to classify it into two domains. FOR AB-INITIO in future: a) check DOMPRO. must trust DOMpro because it is the best ab-intio (espeicall when prot size > 200) ******************************************************************************************* b) check secondary stx prediction to see secondary stx patterns (types) ******************************************************************************************** c) refer to DOMSSEA as well (additional evidence) d) size > 200, it is possible to have two domains. size < 130, usually should be one domain. e) for template with svm score > -0.5, also need to consult FR. *************************************************************************************** f) check alignment file. (***************VERY VERY VERY USEFUL**********************) **************************************************************************************** Acutally, for this one FOLDpro find 1VK1 (svm=-0.25) which is used by many other servers. according to this target, domain should be 2. we should have sticked to this one. Of course the domain architecture can be adjusted. ------------------------------------------------------------------------------------------------------------- T0356: hard large hard protein at least it has two domains (maybe three domains------------VERY POSSIBLE! (third domain from 347 - end?) take the front and end parts and submit them to FOLDpro public. Later, we may combine them with the first template of 3Dpro. FOLDPRO: model 7, 1, 3, 5, 6 as model 1,2,3,4,5 3Dpro: models human, 7, and 3 of FOLDpro and models 7,8 of 3Dpro as model 1,2,3,4,5. DECISION: DOMAIN IS SET TO 3 by checking secondary stx pattern and sequence alignment file. --------------------------------------------------------------------------------------------- T0355: hard FR (domain hard) FINAL DECISION: select human1, human2, human3, human4, human6 for both FOLDpro and 3Dpro. (to distingish them, change model order of model 2 and model 3 for FOLDpro and 3Dpro) change domain to single domain. psi-blast only finds c-terminal region (1/3 of total length) Strategy: FOLDPRO: 1. submit hte first 2/3 of sequence to public FOLDpro to see if we can identify some significant match. then use it to combine with the first half later. running, but not necessary anymore. 2. generate a stx using all possible alignment in pir, particularlly long templates 3Dpro: model 4 (1KA9F) and 5 (1THFD) are better becaues it uses long templates. according to sspro (alternating helix and strand) and model4/5, it is a beta-barrel. now life is eaiser. model 7 is also ok. model 8 also tries to make a barrel, but front region is not well covered. model 6 also try to make a barrel, but front region is not well aligned. model 3(frcom): try to make a barrel, but too many conflicts model 1: half barrel. PROBLEMS: psi-blast alignment is too short even though maybe the whole template can be used advanced combination tends to select short frageents from many templates. should select long fragments. so templates should be ranked also by alignment coverage, not just svm scores. CHECK FOLDpro (fr models) frcom: too many conflicts. model 4-7: try to make barrels, but alignment is not good model 8 is better. (1QOPA) Analyze the following 12 tempates: 1KA9F: tim beta/alpha barrel. (Histidine biosythesis enzyme family). reso=2.5 model 4 (already generated) (human 2.pdb....) 1THFD: reso=1.45 (this one should be used), tim beta/alpha barrel. (Histidine biosythesis enzyme family) model 5 (already generated) first model???? (human1.pdb....) 1QOP: reso=1.4, fold: tim beta/alpha-barrel, Ribulose-phosphate binindg barrel (super), tryptophan biosynthesis enzyem family human4.pir 1H5Y, reso=2.0, tim barrel, family: histidine biosynthesis enzyme human3.pir 1O5KA: tim alpha/beta barrrel, super: aldolase, fam: Class I aldolase human5.pir 1RD5: reso = 2.02, tryptophan synthase (no scop definition yet) human6.pir 1P4C, reso=1.35, fold: tim beta/alpha barrel, super: FMN-linked oxidoreductase, fam: FMN-linked oxidoreductase human7.pir 1F6K, reso=1.6, fold: tim beta/alpha barrel, super: aldolase, fam: class I aldolase human8.pir 1vhn: reso=1.59, fold: tim beta/alpha barrel, super: FMN-linked oxidoreductase, fam: FMN-linked oxidoreductase human9.pir 1QO2, reso=1.85, tim beta/alpha barrel. (Histidine biosythesis enzyme family) humana.pir 1HL2: reso=1.8, 1F6K, reso=1.6, fold: tim beta/alpha barrel, super: aldolase, fam: class I aldolase humanb.pir 1GOX: reso=2, 1vhn: reso=1.59, fold: tim beta/alpha barrel, super: FMN-linked oxidoreductase, fam: FMN-linked oxidoreductase humanc.pir T0354: hard FOLDpro: replace model 2 with model 6. 3Dpro: use model 8 as model 1 use mode 7 as model 5 T0353: hard 3Dpro: MODEL 1 IS THE SAME AS FOLDPRO. leave it. FOLDpro: DECISON: NO CLEAR PREFERENCE. LEAVE IT. model 1 vs model 2 Rmsd = 3.1Å Z-Score = 4.4 Sequence identity = 37.7% Aligned/gap positions = 69/4 model 1 vs model 3 Rmsd = 2.9Å Z-Score = 4.6 Sequence identity = 39.2% Aligned/gap positions = 74/8 model 1 vs model 4 Rmsd = 2.5Å Z-Score = 4.1 Sequence identity = 21.1% Aligned/gap positions = 71/16 model 1 vs model 5 Rmsd = 3.8Å Z-Score = 3.9 Sequence identity = 39.4% Aligned/gap positions = 71/18 model 5 vs model 2 Rmsd = 2.2Å Z-Score = 4.9 Sequence identity = 100.0% Aligned/gap positions = 70/0 model 5 vs model 3 Rmsd = 5.3Å Z-Score = 4.6 Sequence identity = 93.6% Aligned/gap positions = 78/6 model 5 vs model 4 Rmsd = 2.7Å Z-Score = 4.6 Sequence identity = 100.0% Aligned/gap positions = 76/0 model 2 vs model 3 Rmsd = 2.1Å Z-Score = 5.0 Sequence identity = 100.0% Aligned/gap positions = 70/0 model 2 vs model 4 Rmsd = 1.7Å Z-Score = 4.9 Sequence identity = 82.6% Aligned/gap positions = 69/2 model 3 vs model 4 Rmsd = 4.7Å Z-Score = 4.7 Sequence identity = 86.6% Aligned/gap positions = 82/2 so all five models are very similar, especially models 2,3,4,5 are very similar. ------------------------------------------------------------------------------------------------------- EVALAUTION OF T0313 Using a lot of templates in very significant case is not always bad: T0313: 3Dpro score is 79, very good. FOLDpro is 80 (highest so far). SO SOMETIMES, WE STILL NEED TO PUT MULTIPLE TEMPLATE CM ON THE TOP USING CE COMPARISON (COMAPRE CM-MULTI AGAINST EASIEST, CM1, CM2...) FOR THE LAST ONE THIRD OF COMPETITION, WE MUST BE CAREFUL TO DECIDE WHEN TO USE MULTI-TEMPLATE, WHEN NOT (JUDGE CASE BY CASE, CAREFULLY DO STX COMPARISON) -------------------------------------------------------------------------------------------------------- T0352: hard 3Dpro: no change FOLDpro: no model is good. use model 6 to replace model 1 according to visual inspection and verify 3d score. model_check score of model 6 is a 5 points less (18) use model 1 to replace model 5 T0351: hard 3Dpro: replace model 1 using model 6 (ab initio, becase it is a short protein) replace model 3 using model 1 replace model 4 using model 8 FOOLDPRO: TRY TO get rid of model 2 and 4 ab-initio predict two domains. but the protein is very small???? compare model 1 and 7: Rmsd = 3.5Å Z-Score = 4.1 Sequence identity = 91.1% Aligned/gap positions = 79/34 compare model 1 and 3: Rmsd = 4.6Å Z-Score = 4.7 Sequence identity = 84.5% Aligned/gap positions = 97/5 comapre model 1 and 5 Rmsd = 4.1Å Z-Score = 4.4 Sequence identity = 81.3% Aligned/gap positions = 91/29 compare model 1 and 2 Rmsd = 7.6Å Z-Score = 2.0 Sequence identity = 23.4% Aligned/gap positions = 64/52 comapre model 3 and 5 Rmsd = 4.8Å Z-Score = 3.9 Sequence identity = 55.7% Aligned/gap positions = 88/30 comapre model 3 and 7 Rmsd = 6.4Å Z-Score = 3.3 Sequence identity = 68.8% Aligned/gap positions = 80/47 compare model 5 and 7 Rmsd = 4.2Å Z-Score = 4.4 Sequence identity = 95.2% Aligned/gap positions = 84/15 ******FINAL DECISION******** use model 6 as model 2 use model 7 as model 4 domain: single domain due to length limit if no strong evidence existing to favor two domain. (<130, single domain) ****************STILL NEED TO THINK THIS CASE HARD, FR MODEL ALSO PREDICT TWO DOMAINS, BUT CAN 30-40 RESIDUES BE TREATED AS ONE DOMAIN????????????????? T0350: hard not change to make. --------------------------------------------------------------------------------------------------------------- T0349: hard 3Dpro: run 1: model 5 is very bad. use model 6 to replace model 5. wait for the second run to finish. replace model 4 with the model 1 of foldpro. done. FOLDpro: both run finished. model 6 looks pretty good. use model 6 as model 1 use model 1 as model 2 also need to use 3Dpro (model 2 by 1ZHV: different alignment using same template can generate better models!!!!) use model 2 of 3Dpro as model 4 T0348: easy ? 3Dpro: find one match: 1PFT which is also a short protein. the generated stx is not very compact though. SCOP: SMALL PROTEIN, ZINC BETA-RIBBON, trascriptional factor domain CATH: mainly beta, single sheet. It is a NMR stx. T0347: hard target FOLDpro: domain is classified into two domains by fr. Should we use ab-initio one domain prediction because the FR mode is not good? (POST THINKING AFTER PREDICTIONS ARE PUBLISHED ON THE WEB, svm score =-0.25, should at least use FR as guide) check dompro raw file, it does predict a "TTT" at position 137 and 162 respectively. DECISON: USE DOMPRO SINGLE DOMAIN PREDICTION because FR prediction is not good at all. --------------------------------------------------------------------------------------------------------------- T0346: easy target 3Dpro: use easy models 1-3 as model 3,4,5. keep original model 1 and model 2 FOLDpro: use easiest as model 1 FOLDPRO: use easiest as model 1 cm model as model 3 easy model 1-2 as model 4-5 compare model with easiest: Rmsd = 0.9Å Z-Score = 7.1 Sequence identity = 100.0% Aligned/gap positions = 172/0 compare easy 1 with easiest: Rmsd = 1.3Å Z-Score = 6.6 Sequence identity = 99.4% Aligned/gap positions = 164/16 compare easy 2 with easiest: Rmsd = 1.3Å Z-Score = 6.6 Sequence identity = 99.4% Aligned/gap positions = 164/16 compare mode 2 with easiest: Rmsd = 0.8Å Z-Score = 6.9 Sequence identity = 95.2% Aligned/gap positions = 166/5 T0345: easy target 3Dpro: use easy 1-4 as model 1-4. use original model 1 as model 5. compare easy model 1 with cm model 1: Rmsd = 0.2Å Z-Score = 7.3 Sequence identity = 100.0% Aligned/gap positions = 182/0 FOLDpro: use easy 1-4 as model 1-4. use original model 1 as model 5. T0344: hard target. 3Dpro: use model 6 to replace model 2 FOLDpro: exchange model 1 and model 2 resubmit model 5 due to model loss. -------------------------------------------------------------------------------------------------------------- T0343: hard target (alpha and beta protein, sheet is buried). 3Dpro: use model 4 as model 1, use model 7 as model 4, use model 8 as model 5. FOLDpro: Exchange model 3 and 1. use model 7 as model 2. T0342: hard target (NEED TO VERIFY LATER IF IT SHOULD BE CUT INTO TWO DOMAINS. 50 RESIDUES OF ALPHA HELIX??) FOLdpro: one template 2G0QA is found, but only cover one fragment. cm only find one domain. so domain combination is hard. maybe use fr or use the full length of original template? according to the secondary stx prediction of the last 50 residues, the template has strand, oop, one long helix. the target has three helices. so the second part of the template doesn't match significantly with the target? 1)now make a human alignment to use the full length of the sequence. human model is done. predicted gdt score: 44. 2)submit second part to public foldpro server. 3Dpro: cm found nothing. but fr found the following templates: 1 1VKBA 1.08 2 2G0QA 1.04 (same as found by cm of foldpro) 3 1XHSA 0.99 4 1V30A 0.81 DECISION: FOLDPRO: USE MODEL 1 TO REPLACE MODEL 2 USE MODEL 7 TO REPLACE MODEL 1. DOMAIN: undecided yet. (leave it alone? one domain?) compare model 1 and model 7: Rmsd = 1.7Å Z-Score = 5.3 Sequence identity = 77.0% Aligned/gap positions = 100/30 The last 50 residues are not well aligned as we expected. 3Dpro: compare 3Dpro model 1 with the model 1 of foldpro Rmsd = 2.7Å Z-Score = 5.3 Sequence identity = 79.4% Aligned/gap positions = 107/18 compare 3dpro model 1 with model 7 of foldpro Rmsd = 2.9Å Z-Score = 6.0 Sequence identity = 91.6% Aligned/gap positions = 131/16 T0341: easy 3Dpro: use easy model 1,2,3 to as model 3,4,5. FOLDPRO: the tail dangling region is assigned to domain 2 (should be domain 1, a but in parsing perl script. to fix). the first segment of domain 1 is too short. manually adjust it. use easy model 1,2,3 as model 3,4,5. compare cmfr model with easy model 1: Rmsd = 0.6Å Z-Score = 7.6 Sequence identity = 99.2% Aligned/gap positions = 245/4 T0340: easy FOLDpro: use easy model 1-4 as model 1-4. use orginal model 1 as model 5. 3Dpro: use easy models 1-4 as model 2-5. T0339: easy target FOLDPRO: domain is hard to determine (2 or 3 domains??) CATH: 2 DOMAIN. SCOP: 1 domain. So let's do two domain prediction. domain 1: 1-13/288-end. domain 2: other. use easy model 1-4 as model 1-4 use original model 1 as model 5 3DPRO: use easy model1 to 4 as model 2-5. T0338: easy target 3Dpro templates: 1OKVB: two domains (two orthogal bundles) 1AIS: two domains (two orthogal bundles) So this protein is two-domain protein. the front end and back end may be disordered. PDP parse non-continuous domains and classify residues at front and end into other domains? how we handle this??? The real bug is in our parsing scripts: pdb classify two domains: 146-246 and 22-145. so we assign domain 1 to 146-246, domain 2 to 22-145. but the front end is assigned domain 1. that is why we see crossing. If domain parser first order the domain segments, this problem can be avoided. ************************VERY IMPORTANT********************************************* THIS IS A BUG IN parse_domain.pl. WE NEED TO FIX THIS BUG LATER. FOR NOW, WE NEED TO MANUALLY VERIFY DOMAIN PARSING USING PDP PROGRAM AND FIX PROBLEMS IF BUG HAPPENS. I THINK THIS PROBLEM HAS HAPPENED BEFORE, AT THAT TIME WE DON'T KNOW THE REASON AND DIDN'T IDENTIFY THE PROBLEM. ************************************************************************************ DECISION: 3DPRO: REPLACE MODEL 4 WITH EASY MODEL 2. FOLDPRO: change domain model to let front/back ends belong to domain 1 and 2 respectively. use easy model 2 of 3dpro to replace model 4 of foldpro. ------------------------------------------------------------------------------------------ T0337: easy (CANCELD, EARLY RELEAST) 3Dpro: 2 domains (alpha domain and a+b domain) anyway: replace model 3, 4, 5 with easy models 1, 2, 3. (to do) T0336: hard target (CANCELED, EARYLY RELEASE) 3Dpro: find several significant matches with similar folds. 1OYZ: one domain (both CATH AND SCOP), repat alpha. 1Q1S: same 1EE4A: same 2BPT: not classified yet, but similar 1XM9: armadillo repeat domain 1TE4: same 1JDH: same foldpro: replace model 2 with model 7 (to do) the top template: 1W63Z is not really good. it is lower resolution and in different fold according to SCOP. We should remove it to regenerate model???? compare current model 1 of foldpro and 3dpro: Rmsd = 3.7Å Z-Score = 6.6 Sequence identity = 91.1% Aligned/gap positions = 214/32 remove 1W632 to regenerate a model and compare it to the model 1 of 3dpro. a new model is generated: model check score: 0.37, slightly lower than model 1 (40). compare it to 3dpro model 1: Rmsd = 3.4Å Z-Score = 6.6 Sequence identity = 82.2% Aligned/gap positions = 213/48 DECISION: USE HUMAN MODEL TO REPLACE BAD MODEL 2 and LEAVE MODEL 1 AS IT. ------------------------------------------------------------------------------------ T0335: hard target (pretty hard) 3Dpro: replace model 5 with model 8 Foldpro: exhange model 1 and model 4 replace model 3 with model 7 resubmit model 5 due to error. --------------------------------------------------------------------------------------------- T0334 psi-blast found two very significant templates: 2AQJ and 2ARD. evalue of both is 0. 2AQJ has better resolution (1.8 vs. 2.6), 2AQJ has higher ientity rate (0.54 vs. 0.53), higher positive rate (0.73 vs. 0.7), lower gap rate (0.02 vs. 0.05). to check the structure of both. We should use 2AQJ as model 1. Since gap is very small, we should not use cmfr. Model 1: 2AQJ Model 2: 2ARD Model 3: combine of them model 4: others. We probably don't need to use FR at all. Use CE to compare two templates: Rmsd = 0.6Å Z-Score = 8.1 Sequence identity = 100.0% Aligned/gap positions = 517/0 They are almost exactly same. So, just use 2AQJ as template for model 1. 2ARD has one more small gap (probably due to disorder or high b-value?). According to visual inspection, it looks like a a+b single domain protein. pdp and pdb also classify the protein into one domain. 3Dpro: easiest is put on the top. compare model 1 (easiest) and model 2 (cmfr): Rmsd = 0.4Å Z-Score = 8.4 Sequence identity = 99.8% Aligned/gap positions = 517/16 compare model 1 with model 3( cm) Rmsd = 0.4Å Z-Score = 8.4 Sequence identity = 99.8% Aligned/gap positions = 518/14 compare model 1 with easy 1: Rmsd = 0.6Å Z-Score = 8.4 Sequence identity = 99.8% Aligned/gap positions = 519/12 DECISon: Replace model 4,5 with easy model 1 and 2. FOLDPRO: Replace model 4, 5 with easy model 1 and 2. ******************************************************************************************** A NEW START (SINCE JUNE 16, 2006, SECOND HALF OF CASP7) ******************************************************************************************** VERY HARD RULE: for template with evalue < -100 or -120 or -150, identity rate > 0.5, the top 1 template should be used only as long as it has good resolution. HALF CASP MILESTONE NEW FEATURES (HALF CASP IS GONE) June 16, 2006. Now I have implemented the protein stx modeling based on CM model using only single top ranked templates and fragments supplemented by other templates if possible. So, in future, we will use these models to replace many bad models in regular generation. Especially, for very significant templates (e < -90 and cover > 0.9 or 0.85), we should use the top single template as the best model. I also adjust the cm and fr options (combinations) to reduce max linker size to 3 or 5 for both foldpro and 3dpro. (effective since target T0334) I also adjust the e-value difference for significant combination to 5 from 10. Thus we are going to combine less, but more close templates in future. We still need to be very careful about dangling region (>25 residues). we need to either extend the alignments on the same template or drag fragments from other templates. (so we still need some human intervention if necessary) another lesson: the model check score is good at discrimnating good models from bad models. (score diff > 15) but it is hard to discriminate best models from good models ( score < 10) So model check is still reliable, but don't expect to rank the best models on the top always. Lessons: Use more templates (when there are very significant match), is not always good (T0291) Have a large dangling region (due to less complete local alignment of psi-blast), it can cost a lot of gdt-ts scores. (T0293) For ranking now, we need some human intervention. At the same time refer to the model check scores, e-value, svm-score, template resolution, visual inspection. T0291: blast info: temp_name, length, score, evalue, align length, identity rate, positive rate, gap rate top 2 (two chains of same protein): 1JPAA: evalue is -153, cover rate: 0.89, identity rate: 0.74 no. 3: 2SRC, evalue: -143, align length=285, identity rate=0.42. The difference of evalue and identity rate is very large. so we should not we no. 3 and below. ********************************************************************* IMPORTANT IDEA: TO DO THE BEST IN THE EASY COMPARATIVE MODELING, WE NEED TO ADD ONE EXTRA LAYER TO THE PIPELINE. WE USE BLASTP TO BLAST DATABASE WITHOUT PROFILES TO IDENTIFY VERY EASY TEMPLATES. IF THE COVERAGE > 0.85 AND WITHOUT VERY BIG GAPS (>20 RESIDUES) AND RESODULTION < 2.5 AND EVALUE < -90, WE GENERATE A MODEL AND THE MODEL SHOULD BE PUT ON THE TOP. THIS MODEL IS USUALLY THE BEST MODEL. THUS, OUR PIPELINE WILL HAVE FOUR LAYERS: BLAST, PSI-BLAST, FOLDPRO, AB-INITIO. ********************************************************************** Let's do it now. done. for T0291, blast easily find 1JPA for T0291. 1JPA resolution is 1.91, evalue: -134, ind=0.74, cover ratio= 0.87 (actually, the residues of not covered area are not evaluated because they are disordered (coordinates are missing). for the model generated from blast alignment, we got score 88, better than 78 of combination, close to the best 91. For the model and alignments: seee /home/jianlinc/eval_casp7/easy not add the easy_main.pl to the web server and test. for T0290: evalue of blast is only -82, so no easiest model is generated. but this model is still pretty good. but later psi-blast model is also pretty good. for T0293: not significant templates found by blast for T0295: found: e=-117, ratio: 0.99, resolution = 1.9 295 is an interesting example: frcom (score is 82) is better the first model (cm, score is 74) and psi-blast is using one template same as the one found by blast. it is interesting to compare their scores. blast score is 75.64, psi-blast is 74.5. so set final evalue to -100. ------------------------------------------------------------------------------ T0333 very easy cm. generate a cm model only using the top template. (running on mine4). done. both 3dpro and foldpro: use easy model 1-4 to replace model 2-5. for the top, I still use the multiple templates so far. for foldpro: make one try: the top 1 model (template resolution is 1.8) the no 2 model (tempalte resolution is 2.8) so combination model (cm.pir) may not be as good as the top 1. So decide to exhange model 1 (cm model) and model 2 (easy model 1). (leave 3dpro unchanged for comparison later). and compare model 1 and model 2 (very similar): Rmsd = 2.2Å Z-Score = 7.0 Sequence identity = 80.9% Aligned/gap positions = 335/82 T0332: easy target foldpro: replace model 3 with model 6. done. ----------------------------------------------------------------------------------------------- NEW IDEA FOR VERY EASY TARGETS, WE DON'T NEED FOLDPRO. JUST USE CM. BUT WE ARE GOING TO GENERATE MORE MODELS USING THE TEMPLATES IDENTIFIED BY PSI-BLAST. ONE COMBINATION, THEN MODELS USING TOP RANKED TEMPLATES RESPECTIVELY. (NEXT CASP). FOR CASP7, WE USE MANUAL TWEAKING. GENERATE MODELS FOR TOP RANKED TEMPLATES. APPLIED ON TARGETS SINCE T0332. 6/15/2006. ----------------------------------------------------------------------------------------------- T0331 FOLDpro: exchange model 1 (frcom) and model 2 because model 1 has some knots. (to do) 3Dpro: use model 2, 3, 5 to replace model 1 and two other bad models. (to do) T0330 resubmit model 1 due to error. done. T0330TS137_1 PIN_336812_18259 1127-6715-8809 06/14/06 17:07:03 pfbaldi@ics.uci.edu -------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------- Two post analysis problems: (a) T0293: dangling resion is not aligned. so the score is lower. we need to fill the short psi-blast alignment manually. (b) multiple templates doesn't necessarily help improve accuracy. so we need to restrict the number of templates by changing the evalue of cm ( to -5)? to do???? ************************************************************************** MIDDLE-TERM LESSON: VERY VERY VERY IMPORTANT LESSONS: In the future, IF WE SEE UNALIGNED REGIONS IN CM, WE NEED TO PICK SIGNIFICANT FR ALIGNMENTS TO FIX THE HOLE MANUALLY BY CHECKING THE XXX.FR.PIR FILE. THEN USE THE SIMPLE COMBINATION TO COMBINE IT WITH CM OR TAKE THE PORTION OF MISSING TO FIX THE HOLD. 40 RESIDUES HOLE CAN COST US 15 GDT-TS POINTS. T0293 IS THE LESSON. ALSO, WE CAN SIMPLY TAKE THE TEMPLATES FOUND BY CM AND EXTEND IT (OF CHECK ITS ALIGNMENT IN FR TO FILL THE GAP. IN ANY CASE, I DON'T ALLOW LONG DANGLING REGIONS. ************************************************************************** ----------------------------------------------------------------------------------------------- T0330: easy target 3dpro: template 2GFH, two doamins: 1-108, 108 to end. 1jud: two domains according to cath, 1 domain according to SCOP. Surprisingly, the same templates (1jud and 2gfh are chosen for T0329 as well?) ------------------------------------------------------------------------------------------ T0329: easy target foldpro: template 1JUD. scop classify it into one domain. cath 2 domains. visual inspection -> 2 domains. This is a two domain protein. FR find a lot of templates. but the templates on top seem to only match one domain of T0329. Thus predicted GDT-TS score is only half of cm model. This also reflect the problem of global alignment in stx generation (for wrong domain architecture, alignment can't be correct). but (in Edgar's Current Opinion Stx Bio, June/July, 2006, he mentioned a paper that can handle non-matched domain architecture in alingment. maybe we want to try that method). T0328: easy target 3Dpro: single domain according to visual inspection and PDB. The template of this protein is very new (released in May, 2006). model 3 is a bad model. but since we have a very good model on the top, and models 6-8 are also very bad. final decision: replace model 3 with model 6. ------------------------------------------------------------------------------------------ T0327: hard target 3Dpro rank 1XMA top 1, foldpro rank it 2nd. considering model check score too, for foldpro, we need to rank 1XMA on the top (exchange model 1 and 2). ---------------------------------------------------------------------------------------------------- T0326: easy target 3Dpro: template: 2GHR visually inspect cm model, it is single domain protein. but domain parser cut it into two domains. PDB also classify it into 1 domain protein. Decision: 3Dpro: exchange model 1 and 2 according to model check score and visual inspection (exhange cmfr and cm_ab). ------------------------------------------------------------------------------------------------------- START FROM TARGET 323, WE ONLY COMPARE RESULTS OF FOLDPRO AND 3DPRO WITH OUT BACKUP. FOR OTHER SERVERS, WE ONLY CHECK THE EXISTENCE. -------------------------------------------------------------------------------------------------------- Lesson: (overcut on T0321 hard target?) domain prediction for T0321 (hard target?). since it is a hard target (score = -0.2?), the fr stx is not very compact and confident. domain parser overcut the protein into two domains. In this case we need to consult dompro, meta domain, and pdb template to make final decision The lession is that for the hard target, we also need to carefully check the domain prediction and check if domain parser makes a reasonalble cut. another lession learned from T0319 is the single domain limit should be set to 140. for all the targets less than 140 residues, the domain number is set to 1 anyway. -------------------------------------------------------------------------------------------------------- To do: 1. add model check to the online web server script to generate gdt-ts scores for each model. done. 2. disable the automatic sending of domain model of foldpro? (yes) done. 3. disable the automatic sending of 3d models of foldpro? (no) Since next week, we are going to use model_check as the main tool and references to rank models especially for models with negative scores. Visual inspection is used to removed apparently bad, very loose models. -------------------------------------------------------------------------------------------------------- T0325: hard target 3Dpro find two positive templates. the top 1 (1V6T) generate a tim barrel stx. According to SSpro, the target does have intervening helix and strands. but the middle (some region), only helices are predicted, there is no strand. Of course, there are loops in the middle. so the ss eveidence is not very obvious. According to SA, there are intervening buried and exposed fragments. So it could be a TIM-barrel. But there is a knot generated probably due to a large loop. Now regenerate models using fr1.pir (to see if the knot disappears). there is still a knot. but I might use the new human model to replace the model 1. (need to compare them to make sure they are very similar). FINAL DECISION: leave 3Dpro as it. FOLDpro: put 1p49 and 1v6t as top 2. we need to let 1v6t as model 1 according to the consensus of 3dpro and foldpro. decision: exchange model 1 and 2. also regenerate domain model. Apply model check on stx of 1v6t: score: 44 (3dpro), stx of 1p49, score is less than 20. for ab-initio, the score is 40 (so the models generated by ab-initio usually has high scores, probably due to enforcement of SS. but it doesn't mean it has very high gdt scores. it is a bias. We can train a model to predict gdt-ts score for ab-initio model only). T0324: easy target foldpro: two domains. domain 1(1-15,82-end), domain 2(16-81). 3dpro: model 2 (frcom) is bad, models 3 and 4 are bad. leave as it. This target is one example that fr yields different second domain as cm. fr predict two aba domains. cm predict one aba, one helix. according to sspro secondary stx prediction, cm is right. for two-domain protein, this problem can happen on fr because fr is using global alignment . if only one of two domains match, it will use two domains anyway. T0323: easy target. 3Dpro: template: domain 2 of 1MPG (SCOP is single domain), CATH classifiy it as two orthogal bundles? maybe it is really two domains: 1-150 (domain), remainder (domain 2) or (31-150, 1-30/151-end) domain is really hard to determine for this protein. I prefer two domains currently. FOLDpro: replace model 4 with model 6. for domain (the last dangling region is classified to domain 1 again: 1 - 2 - 1 -2). change it to 1-2-1 format. (this is a problem of domain parser, we need to modify script to handle this). T0322: easy target. 3Dpro: cm_ab (model 2) looks better than cmfr. should we exchange them??? the first 18 residues of cmfr is not well predicted (a stick and cause three clashes). compare model 1 and model 2: rmsd=0.9A, z=6.7, ind=100%, aligned/gap = 139. decide: exchange model 1 and model 2. T0321: hard target leave as it. T0320: easy target foldpro: 1sur can cover the first two-thirds of the target, 1zun: can cover the first two-thirds of the target. (the last part of 1zun has not structural info) so we need combination from other source to fill this protein gap. The covered part is single domain. submit the second part to foldpro: public. done. (maybe some postive will come out) This protein may be a two-domain protein. according to cmfr model of 3dpro: domain 1: 1-230, domain 2: remainder. according to foldpro-public: the second domain is an ab-initio domain. ------------------------------------------------------- T0319: hard target DOMpro predict as two domains even though the protein is small, but I will stick to it. to do: use model check to evaluate and rank models. FOLDpro: use model 5 as model 1 (according to model check score and visual inspection) use model 1 as model 2 use model 6 as model 5 done. Change mind. It is more like a single domain protein. done. (note: meta server also predict single domain) T0318: large protein. from 80 to end has a match (mult-domain) template 1LAM: two domains. scop: domain 1 (1-159): Macro domain-like, domain 2(160-484): Zn-dependent exopeptidases cath: domain 1: aba, Leucine Aminopetidase, domain 2: aba, Amino peptidase, zn. human prediction: domain 1: 1-166 (the first 57 residues are not aligned) domain 2: 167-end (aba domain) Human: FOLDpro: two significant templates: 2EWBA and 1LAMA, the first 31 residues are not used. so we can manually to add these residues to cm alignment to generate a more complete stx. Running................ the new stx is good. the domain 1 is improved.However, there is knot. we need to adjust it. now, make a slight adjustment to the alignment file, and regenerate a human2. now human2 doesn't have a knot. 3Dpro: cm model has the domain 1 mostly correct. but one strand is missing from the beta-sheet. the alignment of one template is different from FOLDpro that is why the stx of domain 1 is somewhat different. compare the human model (human 1) of foldpro and cm model of 3dpro: rmsd=2.9, z=7.8, ind=85, aligned/gap = 450/51. compare the cm model of foldpro and cm model of 3dpro: rmsd=2.7 z=7.8, ind = 91.7, aligned/gap = 444/38 Looks like cm model of foldpro is more close to cm model of 3dpro. Should we use human prediction of foldpro. compare human model (human 2) of foldpro and cm model of 3dpro: rmsd=2.5, z=7.7, ind=86, aligned/gap=450/49. compare human 1 and human 2 of foldpro: rmsd=1.5A, z=8.2, ind=100, aligned/gap=483. compare human 2 of foldpro with cm of foldpro: rmsd=1.4, z=8.0, ind=99.8, aligned = 430/10 (the first 50 residues are not aligned). so the quality of human model is better than cm model of foldpro. 3dpro human: try to only use one template in 3dpro (the longest one) to regenerate a model to see. (but its resolution is 2.5 lower than other templates (1.5)). compare 3dpro human (single tempalte) with cm of 3dpro: rmsd=3.3, z=7.1, ind=74.3, aligned/gap=439/76. it's quality is not as good as the cm model. Decision FOLDpro: use human 2 as the model 1. cm model can be used to replace other bad models. 3Dpro: cm model as model 1, the human model can be used to replace other bad models. some fr models are prettry good. (DALI alignment, not CE alignment as above) FOLdpro: fr based on 1LAM (model 4) and model 5 have better domain 1. compare human1 of foldpro with model 4 domain 2: from 167 to the end is completely aligned. domain 1: from 100 to 166, there is one residue shifting. from 15-100, not well aligned, but there is still stx similarity. rmsd=1.8, aligned residues=462, z-score=57, ind=68. compare human 2 of foldpro with model 4 domain 2: aligned=323(almost completely aligned), rmsd=1.9, ind=97, aligned=323, z=57. domain 1: aligned=138, rmsd=2.4, ind=4, z=18.8 (there are some shifting) compare human 2 with model 5 of foldpro: domain 2: 170-end, z=52, aligned=324, rmsd=1.4, ind=94 domain 1: alingned=90, rmsd=3.3, ind=6, z=6.4. compare human 2 with model 1 of foldpro (cmfr): from 50 to end is almost compltely same: rmsd=2.0, z=62, aligned=449,ind=94, the front end has some similarity. compare human 2 with model 2 of foldpro (cm): rmsd=2.1, ind=98, aligned=449,z=63,from 40 to end is completely alinged. compare human 2 with model 3 of foldpro (frcom) rmsd=2.7,z=54, aligned=438,ind=71 (from 180 to end is completely same), first domain is not well aligned. frcom has a small knot (not very serious) compre model 4 with model 5: domain 2: from 165 to end: rmsd=1.9, ind=97, z=53 domain 1: z=8.1, aligned=120 (from 40-167), rmsd=3.0, ind=13. compare model 4 of 3dpro and model 4 of foldpro: rmsd=1.4, z-score=59, aligned=470, ind=90. compare model 4 of 3dpro with human model of foldpro: second domain is compleltely same. FINAL DECISION: 1. 3dpro: not changed. 2. use model 4 of 3dpro to replace the model 1 of foldpro. The key reason is to improve the domain 1 since second domain is almost same. post analysis: compare 3dpro model 1 with sp3: rmsd=2.5, z=7.8, ind=76.7, aligned=447 compare foldpro model 1 with sp3: rmsd=1.1, z=8.0, ind=77.2, aligned=464/26. ------------------------------------------------------------------------- T0317: easy target CDD templates: 1VHR, alpha-beta protein, single domain. FOLDPRO: model 1 is excellent. model 2 (frocom) and model 3 are bad. but no other good models to replace. just leave it. *****************************LINUX COMMAND**************** f you start a long-running task and forget to add the ampersand, you can still swap that task into the background. Instead of pressing ctrl-C (to terminate the foreground task) and then restarting it in the background, just press ctrl-Z after the command starts, type bg, and press enter. You'll get your prompt back and be able to continue with other work. Use the fg command to bring a background task to the foreground. ************************************************************** T0316: hard target in terms of domains cm finds a match from 1-270, but no match for region from 271-440. to see if fr can find the second region. the template 1VL2 is about 421 residue long and has two domains. Should both domains be used for this target? Or only use the first domain as psi-blast did? Looks like this protein is two-domain. both domains of 1VL2 are alpha beta according to PDB. domain stx of 1VL2 is complex: 1-170/380-421 -> domain 1. 172-370->domain 2. according to SSpro, the first half is a alpha-beta domain, the second half is a completely beta-sheet domain. so, psi-blast is probably right only the first domain of 1VL2 should be used. ***************************************************************************** NEW TRICK TO HANDEL MULTIPLE DOMAINS trick: submit domain 2 separately from foldpro. If some positive templates are found, we will combine it with cm templates. done. if it is ok, we will use ~/jianlinc/modeller.sh to generate stx from combined alignemnts later. results from FOLDPRO (public),no significant match, but 1CQAA can be used?? Rank Name Score 1 1CQAA -0.52 2 2FHXA -0.82 3 1O9YA -0.82 4 1D7YA -0.85 5 1WH0A -0.86 6 1LFOA -0.89 7 1G5UA -0.9 8 1MK0A -0.9 9 1ACFA -0.94 10 1F2KA -0.94 ***************************************************************************** 3Dpro identify: 1B37A: two domains: two FAD/NAD-binding domains (polyamine oxidase) 1RSG: 3 domains 1VBK: two domains 1S3E: two domains: FAD/NAD-binding domain, FAD-linked reductase C-terminal domain 2BXR: large, more than 500 residues? 1C0P: two domains: nucelotide-binding domain (oxidase), FAD-linked reductase C-terminal domain. 1Q15: two domains: adenine nucleotide hydrolase-like, Ntn hydrolase like 1O5W: two domains: FAD/NAD-binding domain, FAD-linked reductases, c-terminal domain 1RU8: two domains: both are adenine nucleotide alpha hydrolase-like accroding to FR: some classify them as FAD/NAD, some as adenine hydrolase, some as others. not consistent at all? 3Dpro cm: 1VL2: two domains: adenine nucleotide appha hydrolase-like, argininosuccinate synthetase only first domain is used. 1gpm: three domains, only domain 2 (central) is used: a denine nucleotide alpha hydrolases-like (is a aba architecture according to CATH). foldpro cm: 1k92: two domains: adenine nucleotide alpha hydrolase-like, argininosuccinate c-terminal domain 1VL2 foldpro cm and 3dpro cm are consistent. foldpro cmfr take a portion from 1N4W for the second domain: (from 331 to 445 of 1N4W) unfortunately, the portion of 1N4W is not used successfully (probably droped by modeller) so the second part is just a stick. try to regenerate cmfr to see that happens. frcom: is completely coils. Human intervention (foldpro: /var/preserve/web/htdocs/T0316) 1. regenerate cmfr: the last 7 templates are revmoved by modeller. remove those templates that cause problem and regenerate. that is cmfr_new.pir and new models are generated in cmfr_new a cmfr_new model is generated. the second domain is not very good. but there is a anti-paralle beta-sheet. this model can be used. 2. make a human.pir and generate stx. model 1 and model 3 must be replaced domain prediction need to be revised. Human intervention (3dpro: /var/preserve/web/htdocs/T0316) 1. regenerate cmfr: ok. the first domain is ok. but the second domain is very bad. 2. make a new alignment by combining cm and the second domain generated by FOLDpro(public) use simple_gap_comb.pl to combine them. human.pir, models are in ./human/. done. looks better. domain prediction: 1-245 (or 240), 246-end. compare the current model 1 of foldpro and 3dpro (only domain 1 matters), rmsd=3.2, z=5.9, ind=56%, aligned/gap = 182/59. They are significantly matched. decision: use its own human as the model 1 of foldpro and 3dpro respectively use cmfr_new of foldpro as the model 2 of foldpro. use human of foldpro as model 3 of 3dpro, use human of 3dpro as the model 3 of foldpro. resubmit domain prediction for foldpro. ----------------------------------------------------- T0315: easy target cdd template: 1J6O, tim-barrel, single domain. FOLDpro: model 5 need to be replaced by model 6 (to do) 3Dpro: ok. T0314: hard target sspro: alpha-beta(a few)-alpha protein. 3dpro: prediction is ok. foldpro: also put 1ksh on the top. foldpro: model 6,7 seems to be better than some of model 2-5. maybe use model 6 and 7 to replace model 4 and 5. and it is a single-domain protein. verify3d for foldpro: model 1 = 6.78 model 2 = 17.43 model 3 = 11.59 model 4 = 8.16 model 5 = 11.15 model 6 = 11.16 model 7 = 27.9 decision: use model 7 to replace model 4 ----------------------------------------------------------------------------- T0313: easy target cm model is very good. frcom is ok. fr1-3 are very bad (model 3-4). *************************PROBLEMS*********************** Why mostly the templates ranked on the top are not good when there are many significant match? Why the best templates are usually (no. fr 3, 4, 5)? in this case it is models 6,7,8? the top ranked templates are short? so it only cover a short stretch. So we really need to have a model selection method or we really need to consider alignment length or we need to rerank significant match using psi-blast? this is an issue we need to address in the future? model 2 (frcom) has a lot of clashes. This is an important case to research this problem. ********************************************************** decison: foldpro: model6-8 to replace model 3-5. resubmit model 1 due to error. all top 5 templates are single domains. so domain is ok. 3dpro: model 1 need to resubmit due to error. also model 3 and 5 are bad. use model 7 to replace model 5. (no substitue for model 3 yet). -------------------------------------------------------- T0312: hard target mainly beta (sspro) both 3dpro and foldpro find the same template. 3dpro classify it as positive, foldpro classifies it as negative. according to the appearance, 3dpro model looks better. maybe it generate better alignments??? need to compare the models 1 of both and decide which one is put on the top. verify 3d score: foldpro: model 1 = 27.6 model 2 = 1.75 model 3 = 1.75 model 4 = 8.3 model 5 = 38 3dpro: model 1 = 25 model 2 = 15 model 3 = 43 (ab) model 4 = 43 (ab) model 5 = 10 model 6 = 30 compare model 1 of foldpro and model 1 of 3dpro: rmsd=1.8, z=6.5, ind=88.6, aligned/gap = 132/6 decision: foldpro: use model 1 of 3dpro to replace model 3 of foldpro (to do). done. 3dpro: use model 6 to replace model 5 (to do). done. --------------------------------------------------------------------------------------------- AN IMPORTANT PAPER An urgent task and a new idea: It looks like that it is very important to have a post-modeling quality evaluation. The quality of the model is not determined by templates, but the final generated models whose quality are affected by alignments, template quality(3D), and so on. For instance, some top ranked models are totally loose by visual inspection. So it is very important to develop a model evaluation tool, especially for hard targets. So now I need to to do this: Features: 1. ss match scores (%) 2. sa match scores (%) 3. 8A average contact probability (>=5 separation) 4. 12A average contact probability (>=5 separation) 5. average contact order (measure complexity) 6. average contact number (measure compactness) 7. verify3d score Other factors to consider: a) svm score (to do). not used now because it is specific to FOLDpro. b) other energy measures (e.g. ab-initio, Arlo's program) c) other tools (procheck, prosaII) d) may include other pairwise energy terms: Lennard-Jones, Skolnick, and so on. Use these features to regress on the gdt-ts scores divided by 100. Train on hard targets, easy targets, and combine to see how well it can improve performance. In this casp period, we only test it on hard targets. So I need to build a dataset from casp 6 hard targets. Hard targets are the targets with SVM score less than a threshold (e.g. 0). If this system is ready, we can use this system + visual inspection + clash checking to make final decision. I won't use clash feature in the quality evaluation. it is just used to exclude models with a lot of clashes. And we only train our systems on the single domains of CASP6. A more practical way is to train the systems on the whole target. Finally we can release this software and server to users. data generation: we should use as many differnt models as possible. Ideally, we should use models in CASP6/7. Now I just use models generated by FOLDPRO. We need to report the following results: 1) square error 2) how does this evaluation improve ranking of each different methods (here, we evaluate on FOLDPRO models) ------------------------------------------------------------------------------------------------------- T0311: easy target 3Dpro: region: 10-65 has match. two ends are not matched. FOLDpro: no cm match is found. This example shows that large NR doesn't mean higher psi-blast sensitivity. both foldpro and 3dpro (fr) find a lot of significant matches. model 3 (frcom) of 3dpro is pretty good. It should be used as the first model of 3dpro and also the first model of foldpro. the frcom (model 1) of foldpro has a long (30 residue) unaligned region. compare model 1 of 3dpro (cmfr) model with model 1 (frcom) of foldpro: aligned regions: 15-71 rmsd=3.5, z=4.1, ind=68.6, aligned/gap=70/15 (the middle region of cm is aligned) compare model 1 of foldpro and model 3 (frcom) of 3dpro, rmsd=1.2, z=5.5, ind=100, aligned/gap=71/0 (from region 0 to 64) compare model 1 and model 2 of foldpro: rmsd=2.7, z=5.0, ind=85.9, aligned/gap=71. decision to do: done. foldpro: resubmit model 1-5 due to errors. use model 7 to replace model 3. 3dpro: use model 3 to replace model 1, use model 6 to replace model 3. this is probably a case where frcom is better than cm because cm alignment is too short??? This target is also an interesting orthoganal helix bundle (not simple helix bundle as T0283) ---------------------------------------------------------------------------------------------- T0310: easy target cdd template: 1BXSA SCOP: 1 domain CATH: 2 domains. visual inspection: 1-269/473-500: domain 1; 270-472: domain 2. both domains are aba type. betasheet is parallel. domains from cm models: domain 1: 1-214, 361-end domain 2: 215 - 369 FOLDpro basically made the right prediction of domains. However, the last dangling regions are assigned to domain 1 instead of domain 2. This is a bug. Eventually should be corrected in the code. Now need to manually correct it (to do). current cutting is: 1 - 2 - 1 - 2 (the last 2 should be changed to 1 or set to -) decision to set to 1. to do. May the whole protruding region should be set to third domain. (but it will be two complex). we still stick to two domains. Later need to check with casp final results. compare model 1 of 3dpro (cmfr) and model 1 (cm) of foldpro using CE: rmsd=4.1, z=6.9, ind=77.4, aligned/gap=376/48 --------------------------------------------------------------------------------------------- T0309: hard target 3Dpro: model 5 is very bad. should be replace by model 7 (or 6). verify3d score: model 1 = 1.1 model 2 = 24 model 3 = 21 model 4 = 7 model 5 = 1.1 model 6 = 8 model 7 = 4.25 decision: exchange model 1 and model 2, use model 6 to replace model 5. done. FOLDpro: model 4 and 5 are very bad. should be replaced by model 6 and 7. model 2 need to resubmit due to error. verify3d scores: model 1 = 5.34 model 2 = 7.26 model 3 = 12.13 model 4 = 1.30 model 5 = 1.59 model 6 = 4.25 model 7 = 6.85 done. -------------------------------------------------------------------------------------- T0308: easy target CDD templates: 1F6B (aba alpha-beta protein) 1MOZ 1EOS ------------------------------------------------------------------------------- T0307: hard target. 3Dpro: model 1 = 21 model 2 = 17 model 3 = 29 model 5 = 17. model 4 score is 4, model 6 score is 37 (highest, higher than model 1) model 7 score is 47. decision: to do: use model 6 and model 7 to replace model 4 and model 5. models 6 and 7 are ab-initio models whose scores correlate with Verify3d pretty well. ****************************************************************************************************** Looks like we need to develop an algorithm to rank models when no positive templates are found according to verify 3d score. Let's say we build 100 models (from 100 templates) using foldpro then rank them using verify3d. next CASP, I try to simulate 200 models from 200 templates if we have enough computing power, then use verify3d or ab-initio to rank these models and select top 5 for hard targets. This is a very simple approach. Hopefully we can identify all FR/A templates. ******************************************************************************************************** COMMENTS: verify3d may not distinguish good models from less good models, but it can identify very bad models. We need to avoid put very bad models (score < 10) on the top. If another model has substantially higher score, it should replace very bad models. ----------------------------------------------------------------------------------------- T0306: very hard target CDD no. AB-INITIO is not used because it was not predicted yet. 3Dpro: model 1 is trash. model 2 is better. model 5 is very bad. model 6 and 7 are better. Decision: use model 6 to replace model 1 and model 7 to replace model 5. done. Model 1 and 2 are rejected due to errors. so model 2 need to be resubmited as well. Use verify3d to evaluate models 1 to 7 of 3Dpro: model 1 score = 1.1 model 2 = 1.62 model 3 = 7.5 model 4 = 1.75 model 5 = 1.1 model 6 = 14.28 model 7 = 10.02 model 2 error is due to the template is too new. resubmit model 1 and 5 now. Model 2 can't be submitted (too new template). replace model 2 with the model 1 of 3Dpro. done. use one abinitio model to replace model 4. done. (ab-initio score is 26) There is one local complexity region in template 1NAYA Should we remove it??????????????????????? Hold on here. if it often appears on the top, it should be removed. FOLDpro model 1 score = 5 model 2 = 3 model 3 = 8 model 4 = 10 model 5 = 17 actions: FOLDPRO: exchange model 5 and model 1 because model 5 is much better. use model 5 of FOLDPRO to replace model 2 of 3Dpro. done. ---------------------------------------------------------------------------------------------------- T0305: easy target The prediction of cm and fr is not very consistent. cm: lagely a a+b domain. according to pdb, templates are consistent (largely ab domain) fr: largely aba domain. according to PDB, it is single domain protein. FOLDPRO find a lot of positive templates, the best templates are not ranked on top and global alignment could cause problems as well. Stick to cm models. for FOLDpro, pdb overcut the domains. alpha+beta is cut into one alpha and one beta. We need to correct it. -------------------------------------------------------------------------------------------------- T0304: hard target (3Dpro, ab-initio model is put on the first) 3Dpro: model evaluation using verify3d model 1= 51 (N/A model) model 2 = 35 model 3 = 46 model 4 = 15 model 5 = 5.8 FOLDpro: model 1 = 20 model 2 = 24 model 3 = 24 model 4 = 11 model 5 = 12 --------------------------------------------------------------------------------------------------- T0303: easy target Problem: the topology of cm model and fr model is not consistent. cm model has two domains: alph/beta + alpha helix fr model has two domains: both are alpha/beta. Check cm models, all templates have consistent structure. CATH consistently classify them into two domains: alpha/beta(aba) and mainly alpha. Also consisent with PDB classification most time. SCOP sometime classifies some templates into one domain, sometime into two domains. So we stick to the CATH classification. SSpro prediction match very well with secondary stx of tempalte 2GFH. need to resubmit model due to chain error. The reason of frcom is that the fold recognition is dominated by one domain. Many templates that match only one domain of the target are selected. the non-matched (aba) domain of the templates are used to model one domain of the target. This cause the problem. This problem is due to two reasons: a) lobster global alignment b) we use full chain instead of individual domain to build template library. This example show that cm model and local blast alignment is better than global alignment in this case. ------------------------------------------------------------------------------------------------ A new FR/A fold recognition approach: Protein fold recognition using pairing, connection, relative position of secondary structure elements (especially designed for hard targets that can't be identified by sequence approaches) 1. beta-strand pairing 2. separation between secondary structures 3. super secondary structure patterns 4. secondary structure types, lengths, orders These information can filter out most proteins. Then align secondary structure together and use energy to select final models. ------------------------------------------------------------------------------------------------ May 25 T0302: easy target CDD templates: 1EMU, or 1AGR probably it is a all alpha helix single-domain protein. T0301: hard target This protein has a lot of beta-sheets and some helix. good news. FOLDpro find two significant matches: 1W61 (res = 2.1A): Proline Racemase, 350 residues, single domain. 1TM0A (res = 2.8A): putative proline racemase, 350 residues, one domain. 1TM0A is an alpha+beta protein. It turns out that two templates are very similar. Using CE alignment, rmsd=1.9A, Z=6.8, Seq ind=29.3%, aligned/gap = 317/42 in terms of size, T0301 is more close to 1W61. Decision to make: should we use fr1 as model 1 or use frcom as model 1. frcom has too many clashes. we need to replace frcom using fr1.pdb. 3Dpro decision: use model 2 as model 1, use fr1.pdb as model 2. done. decision 2: FOLDpro also has the same problem. frcom has a lot of clashes. so replace FOLDpro model 1 with the model 1 of 3Dpro. Also Xu' domain parser prediction: Getting file pdb1w61.ent.Z from PDB... 2 domains have been found for 1w61a: Domain 1: 186-366 (conf:>73.1) Domain 2: 42-185;367-394 (conf:>45.8) Also PDP also predict two domains. So we need to predict as two domains. resumit domain model as 2 domains. This target also has another interesting implication: frcom only combine two templates whose structure are very similar according to CE. but the combined model is not consistent, probably due to alignment inconsistency. So even two structures are similar, if the alignment is not right, the combined models are still wrong. Usually the wrongly combined model has a lot of clashes. -------------------------------------------------------------------------------- May 24 T0300 hard target. An interesting, short protein. DISpro predicts a lot of disordered regions BETApro predicts only a number of pairs, but there are some long range between 35s <-> 95s. SSpro: three helix protein (helix is pretty long). there is one small strand at the end. For this protein, we may want to use ABIpro as well. Human prediction: helix-loop-helix-loop-helix structure. ---------------------------------------------------------------------------------- T0299: hard target. SSpro: an alpha/beta protein a lot of beta-residue contacts according to contact map predictor. very little disorder. ACCpro: beta-strands are buried. dompro predict it as a single domain protein. meta domain predict it as single domain. betapro: predict it has a mixed parallel and anti-parallel beta-sheet consisting of about 9-10 strands. So I predict it has a mixed beta-sheet core and is a single domain protein. Use beta-strand pairing, we can predict the stx of this protein. Human prediction: the protein is a 3-layer sandwich. the core is a mixed or parallel beta-sheet surrounded by alpha-helices on both sides. The model 1 of FOLDPRO doesn't look good. But I have no idea about how to handle this? should we use ab-initio, or should we change model order? and how??? --------------------------------------------------------------------------------- May 23, 2005 Boris said: it might be better to use a smaller, more representative library to remove false positives. I think it might be useful to retrieve a lot of redudant proteins from the same family. Think about this. Since we have two level modeling, a smaller FR library might be ok. ------------------------------------------------------------------------------- T0298: easy target. (template 1GL3). 1GL3 is a two domain protein. doamin 1 is three-layer a/b sandwich, domain 2 is two-layer a/b. (1-135, 136-end?) visually inspect the cm model, it has two domains: domain 1: 1-130 & 320-end): alpha/beta/alph sandwich domain 2: 131-319: alpha + beta. (very similar to the PDP paring. great!!!!!!!!!!!!!!!!!!!!!!!!) comparing the stx generated by 3dpro and foldpro, rmsd=1.1, z=7.8 ind=99.4, aligned/gap = 329/4 LIke 296, this target also has a lot of helix and strands. but intervening helix(between strands) is not regular. the distance between strands is not regular. so it doesn't form a beta-barrel. (this may be a method to distinguish barrel from non-barrels. --------------------------------------------------------------------------------- T0297: easy target, single domain ----------------------------------------------------------------------------------- SO FAR, T0283, 285, 287 ARE HARD TARGETS, SERVERS DON'T AGREE WITH EACH OTHER. ------------------------------------------------------------------------------------- NEW IDEA: MULTIPLE TEMPLATE COMBINATION HAS PROBLEM OF HANDLING INCONSISTENCY IN TEMPLATES. ANOTHER IDEA TO IMPROVE MODEL GENERATIN / SELECTION IS: A) USE ALL SIGNIFICANT TEMPLATES TO GENERATE A MODEL B) CLUSTER THE GENERATED MODELS AND SELECT THE BIGGEST CLUSTER C) SELECT A MODEL CLOSET TO THE CENTER AS THE FINAL MODEL???????????? OR USING WEIGHTED AVERAGE TO CREATE A MODEL? ---------------------------------------------------------------------------------------------- T0296: hard target. Large protein. find a match, but uncharactorized. also not stx found using CDD. DOMpro: domain 1: 1-241, domain 2: 242-445 According to SSpro and ACCpro, the target has alternating helix and strand. And strands are usually buried. So this protein must be a alpha+beta protein, where the outside helices surround the core beta-sheets. the key issue is it is a two domain or one-domain protein? Hard targets. both 3Dpro and FOLDpro can't find positive templates. Use model 1 of 3Dpro to replace model 3 of FOLDpro. DOMAIN PREDICTION NEED TO BE VERY CAREFUL: SO FAR META-DOM (INTERPRO, DOMSSEA, SSEP-DOMAIN) PREDICT ONE DOMAIN. Human prediction: Analyzing the beta-strand pairings of T0296, this protein might have 12 or 13 strands which are paired in parallel. It might be a beta-barrel. So we need to change domain prediction to 1. FINAL DECISION: for 3Dpro, use model 1 of FOLDpro to replace model 4 of 3Dpro. If the protein is not a beta-barrel, it must have a very large beta-sheet inside that are packed by alpha helices. I need to develop a tool to generate tertiary structure according to secondary stx elements and beta-sheet topology (strand pairings), or even turns and helix orentation. One naive approach is to convert this topology into contacts, then use contact reconstruction another approach is to directly set topology and fit in the stx. I need to learn how to write program to set the backbone. Check if there some code existing or our own contact map reconstruction code. This is a kind of tertiary programming. Final decision: I manually predict T0286 as a beta-barrel. So it is single domain protein. but there is maybe a small dangling region (about 40 residues) at C-termini. But it may also be a disordered region. DISpro predict the last 30 residues as disordered. BETApro_contact and SVMcon also prdict the contacts between residues around 20 and residues around 420. So it may be really a large beta-barrel. But BETApro doesn't predict the pairing between strand 1 and strand (11/12/13). That means for the prdiction of beta-barrel (the closing pair), contact map predictor may help. --- considering develop a beta-barrel predictor using both strand pairing and contact. This desion is difficult considering the size of the protein. The largest single domain protein in CASP6 is T0203 which has 365 residues. Post analysis: this is really a hard target. the predictions from all groups are not consistent. SP3 seems to predict a good structure which consist of two similar domains. Each domain is a alpha-beta-alpha. beta is a parallel beta sheet. If this is correct, then the protein should have two domains. the boundary is in the middle. All domain predictors except for Robetta and another group predict as single domain. DOMpro predict as two domains and boundary is in the middle. so is DOMpro and Sp3 right? Did I make a mistake to change DOMpro prediction for a large protein (400 residues?) ----------------------------------------------------------------------------------------------- ********************************************************** DECISION: USE 3DPRO MODEL 1 TO REPLACE FOLDPRO MODEL 1 ( because we only want to use one template for this case) ) before doing that, we need to copy models and recompare model 1 to make sure copying is right. the Model 1 of FOLDpro is used as model 2 of foldpro and 3dpro. slightly adjust domain prediction domain 1: 1-178, domain 2:179-end ********************************************************** T0295 CDD blast: Dimethyladenosine transferase (rRNA methylation) (no pdb hit according to cDD) easy target, find a number of very significant matches. There is a strange issue here: FOLDpro use two templates 3Dpro use one template. Due the difference between e-value (24), both should use only one template, why foldpro uses two???? Reason is unknow. Be aware of this problem. Now I compare the two models of 3dpro and foldpro using CE: we get RMSD: 1.2, z-score = 7.7, seq ind=100%, aligned/gap = 275/0. So two models are very very close, but not identical. it is a two domain proteins: 1-179: domain 1, 180-275: domain 2. for this kind of clear domains, dompro predict 1-161: domain1, 182-275: domain 2. ---------------------------------------------------------------------------------------------- ********************************************************************** *********************WEEK TWO: TWO LESSONS**************************** ******BE VERY CAREFUL ABOUT TWO DOMAIN COMBINATION****** TWO LESSONS SO FAR: a)T0289: TWO DOMAINS, WE SHOULD USE FOLDPRO HUAMN FOR 3D PRO AS WELL b)T0285: HARD TARGET. 3DPRO PREDICTION LOOKS BETTER AND VERIFY3D SCORE IS 26 TWICE AS MUCH AS T0289. I SHOULD HAVE USE IT FOR FOLDPRO AS WELL. AT LEAST, THAT MODEL SHOULD BE PUT INTO THE TOP 5 LIST OF FOLDPRO. c)T0293: I found model 6 of 3Dpro T0293 is very good. it uses template 1xj5. this model should at least be included, (some time maybe model 1). it only uses one template. so we need to visually inspect more models and change order if visually better, svm-score is significant can close. we should also remove apparently bad model in the top 5 and substitute them by model 6, 7, 8, and so on. d)lessons: when both cm and cmfr exist, sometime we want to add good template based on single template, cm can be removed because it should be very similar to cmfr. e) T0293, should include model 6. should not use too many combination models. Also the first fragment (1-60) should be consider a domain. domain is based on the stx, not based on the distance between two fragments. f) when intervene domain prediction, we must be very careful. according to the domainparser, the T0293 should have two domains, but I wrongly classifiy them into 1 domain manulally. the first fragment is long enough to be considered a domain. The key issue: when we consider a fragment just a tail end, when we consider it a indepdent domain. look at structure and also the length of fragment. (length >=35-40?) LESSONS: WHEN FIND A GOOD STX FOR EITHER FOLDPRO OR 3DPRO, SHOULD BE USED FOR ANOTHER DON'T BE AFRAID. AT LEAST, IT SHOULD BE PUT INTO TOP 5 LIST. 5/20/2006 *********************************************************************** *********************************************************************** ---------------------------------------------------------------------------------------------- T0294: easy target: (we are similar to the best) domain 1: 1-100: 1NAC (membrane ion channel-forming peptide), replaced by 1NRU???? domain 2: 101 - 328: 1DXY and 1GDH. 1DXY it self has two domains according to SCOP and CATH. 1GDH also has two domains. So the total number of domains is 3. According to tertiary stx prediction, there are two domains containing non-continuous segments. 3Dpro: 1-104 and 296-328: domain 1, 105-295: domain 2. post analysis compare model against sp3 using CE jigsaw: rmsd=3.0, z=7.6, ind=97, aligned=311 3dpro: rmsd=2.7, z=7.5, ind=91, aliged=307 foldpro: rmsd=3.2, z=7.4, ind=94, aligned=311 hhpred: rmsd=2.8, z=7.4, ind=94, aligned=312 metatasser: rmsd=3.4, z=6.0, ind=81, aligned=298 mGen3D: rmsd=2.2, z=7.5, ind=94, aligned=298, Raptor: rmsd=2.5, z=7.5, ind=94, laigned=307 zhang: z=7.6 Sam: z=3.3: totally a failure??????????????? ------------------------------------------------------------------- T0293: (we are worse than the best.alignment is too short.........................) ????????????????????????????????????????????????????????????????????????????? QUESTION: WHY MOST GROUPS FIND 1T43 AS THE TEMPALTE? DOES EVERYBOY USE BLAST TO SEARCH FIRST? (INSTEAD OF USING PSI-BLAST)?????????? LOOKS LIKE 1T43 IS THE BEST TEMPLATE. IT COVER THE WHOLE SEQUENCE............ ?????????????????????????????????????????????????????????????????????????????? Post analysis 5/22/06 Domain prediction seems to be ok. most group predict it as one domain except that Baker's group consider a small chunk of fusion domain. so two ends of dangling regions don't need to be considered a domain. use CE to compare stx foldpro-sp3: rmsd=2.3, z=6.0, seq ind=55, aligned/gap=177/64. jigsaw-sp3: rmsd=3.0, z=6.1, ind=68, aligned/gap=196/71 hhpred-sp3, rmsd=1.6, z=6.7, ind=83, aligned/gap=206/50 karipis-sp3, rmsd=1.8, z=6.9, ind=90, aliged/gap=214/37 nfold=0.9, z=6.8, ind=87, aligned/gap=199/28 pcons->sp3, rmsd=2.3, z=6.7, ind=93, aligned=215 raptor<->sp3, rmsd=2.6, z=7.2, ind=89, aliged=239 sam-sp3, rmsd=4.3, ind=67, aligned=103. zhang-sp3, rmsd=2.7, z=6.7, ind=86, aligned/gap=223. SO OURS IS ONLY CLOSER TO SPARKS3 THAN SAM-T06. We didn't generate the best alignment with the template. Our alignment is simply too short. Post-thinking on 5/21/2006. It really looks like this protein has actually three fragments. one big fragment (51-210), another two fragments 1-50 and 211-250. The key issue: Are the two fragments at both ends are domains? the fragment 1 may be a zinc finger domain? 3Dpro model 6 is so good that it should be included. According to this one, the fragment 1 should be a domain. the last fragment is just a short stretch that can't be considered a domain. comparing model 1 and model 1(human), the core regions (61-200) are aligned well. but the first domain is not. We should include at least model 1 which use the third template 1jx5. Final judgements: the protein should has two domains, not just one. So HUMAN DOMAIN PREDICTION: 1-50, 11-250 (THE LAST END IS ALSO LINKERS) According to this new domain definition, compare 1-60 fragment of model 1 and model6 using CE: rmsd=4.0, z=2.3, seq ind=3.1, alinged=32. so it is not well aligned. Basically, the similarity of domain 1 is very low. LESSON: SHOULD NOT USE TOO MANY COMBINATION. SHOULD NOT LET BIG DOMAIN COMPLETELY DOMINATE THE SMALL DOMAIN. MORE IMPORTANTLY, IN THIS CASE, OUR AUTOMATICALLY METHOD IN DEED CLASSIFY PROTEIN INTO TWO DOMAINS (1-50, 51-END). domain 1 looks like a zinc finger. combination is hard???? is this due to PSI-BLAST ALIGNMENT (TOO SHORT) OR DUE TO NEEDING TO TAKE FRAGMENTS FROM MULTIPLE TEMPLATES? 5/20/2006: I found model 6 of 3Dpro T0293 is very good. it uses template 1xj5. this model should at least be included, (some time maybe model 1). it only uses one template. so we need to visually inspect more models and change order if visually better, svm-score is significant can close. decision: FOLdpro: use human model 1 as model 1 and change domain to 1 domain. decision: 3Dpro: use human model as model 1 3Dpro comparison: human <-> model 1: z=16.3, rmsd=4.2, ind = 91.2%, aligned/gap = 176 human <-> model 2: z=16.3, rmsd=4.9, ind=87, aligned = 193 human <-> model 3: z=8.4, rmsd =3.0, ind = 53, aligned = 144 human <-> model 4: z=7.1, rmsd = 3.0, aligned = 133, ind = 47. human <-> model 5: z=8.9, rmsd=9.6, ind = 39, aligned = 150. T0293: hard target in the sense of combination. not very hard. 3Dpro find some tempalates, particaully for the central parts. the two ends are not very well matched. visual inspection: the second domain is not well predicted. 1ORI can cover 54-235 (incuding second domain). but is e-value is only e-18, so only small fragments are included. We need to use it to predict a more coherent domain. we need to generate a human prediction for this (at least for FOLDpro) Human in FOLDpro: use cm_main_comb_join.pl cm_opt fasta file, output file to redo by setting e-value threshold to -17, so 1ORI can be used. another issue is: 2B3T aligned in FOLDpro is too short to be used, which cover the front end of the target in 3Dpro. Unfortunately, we add a lot of alignments, Modeller fail to generate a stx. a lot of alignments must be removed. on FOLDpro, finally it generates a stx, but not as good as 3Dpro (cm.pdb). Currently, the final 40 residues of 3Dpro is not well wrapped. The first 33 residues are slightly better predicted. *************To DO****************** FINAL DECISION: take the first model (cm.pdb or cmfr.pdb) from 3Dpro and use it as the first model of FOLDpro. *************TO DO****************** ***********VERY VERY VERY IMPORTANT: Fortunately, 3Dpro generates an excellently stx with only last 30 residues not well predicted. We must use this stx for both 3Dpro and FOLDpro. Also I can take 30 residue fragments from other protein. Now I add one fragment from 1OR8A that is also found by psi-blast to generate a stx. but I can't add the whole 1OR8 because it will cause Modeller to crash. Compare this stx with the top 1 stx of 3Dpro and FOLDpro. If similar this stx should be the stx submitted as model 1 for both 3Dpro and FOLDpro. the alignment file is: T0293.human.pir, pdb file is: T0293.human.pdb in mine 3: /var/preserve/prosys/web/cgi-bin/work/114796762319404-3d/human/out/out DOMAIN ASSIGNMENT: 1 domain. ************************************IMPORTANT PREDICTION********************************************* For this protein (visually inspect cm.pd of 3Dpro), I predict it is single domain protein. Residue 145 to 250 also form a betasheet (four strands) and two helices that should be joined with the resiue 1-145 domain. Unfortunately, I can't adjust the stx manually to make them intergrate together. ******************************************************************************************************* -------------------------------------------------------------------- T0292: an easy target: Serine/Threonine protein kinases, catalytic domain. Phosphotransferases of the serine or threonine-specific kinase subfamily. 1JNK: scop classified it into one domain, CATH classifies it into two domains. pdb classifies it into one domain. DOMpro: predict single domain FOLDpro: classify into two domains (1-86, 87-end). According to visual inspection, it could be one or two domains. Domain prediction is ambiguious. So just stick to FOLDpro. ---------------------------------------------------------------------- T0291 easy target CDD classification and function: Tyrosine kinase, catalytic domain. Phosphotransferases; tyrosine-specific kinase subfamily. Enzymes with TyrKc domains belong to an extensive family of proteins which share a conserved catalytic core common to both serine/threonine and tyrosine protein kinases. Enzymatic activity of tyrosine protein kinases is controlled by phosphorylation of specific tyrosine residues in the activation segment of the catalytic domain or a C-terminal tyrosine (tail) residue with reversible conformational changes. 1FGI: two domains (two protein-kinase like folds according to scop), cath also two domains. 1IR3: one protein kinase domain according to scop. cath: two domains. DOMpro predict as two domains ------------------------------------------------------------------------ T0290: easy target cyclophilin_ABH_like: Cyclophilin A, B and H-like cyclophilin-type peptidylprolyl cis- trans isomerase (PPIase) domain. This family represents the archetypal cystolic cyclophilin similar to human cyclophilins A, B and H. PPIase is an enzyme which accelerates protein folding by catalyzing the cis-trans isomerization of the peptide bonds preceding proline residues. These enzymes have been implicated in protein folding processes which depend on catalytic /chaperone-like activities. As cyclophilins, Human hCyP-A, human cyclophilin-B (hCyP-19), S. cerevisiae Cpr1 and C. elegans Cyp-3, are inhibited by the immunosuppressive drug cyclopsporin A (CsA). CsA binds to the PPIase active site. Cyp-3. S. cerevisiae Cpr1 interacts with the Rpd3 - Sin3 complex and in addition is a component of the Set3 complex. S. cerevisiae Cpr1 has also been shown to have a role in Zpr1p nuclear transport. Human cyclophilin H associates with the [U4/U6.U5] tri-snRNP particles of the splicesome. 1M63: single domain according to cath and scop. DOMpro: single domain ------------------------------------------------------------------------- comparing to SP3, REBETTA, RAPTOR. our second domain not well predicted. second domain looks like a beta-barrel. we find the correct template, 2BCO as sp3, but apparently we didn't get second domain well aligned using psi-blast. apparently, FOLDpro (human) is better than 3Dpro. we should replace both, in terms of domain orentation. but since it is evaluated by by domain, this should be ok. sp3 <-> 3dpro: rmsd: 2.1, z=6.5, seq ind: 65%, aligned/gap = 183/32. (only first domain is aligned) sp3 <-> foldpro: rmsd = 2.3, z=6.8, ind = 48.5, aligned/gap = 266. (align both domains) sp3 <-> Karipis: rmsd=3.3, z = 4.9, ind = 61.3, aligned/gap = 150/51. sp3 <-> raptor: rmsd = 2.0, z=7.0, ind = 77, aligned = 280. sp3 <-> bayeshh: rmsd = 2.6, z=7.0, ind=71, aligned/gap = 279. sp3 <-> hhsearch1: rmsd=3.0, z=6.5, ind=84, algned = 280. THIS TARGET INDICATE THAT WE HAVE CHALLENGES TO GET ALIGNMENT OF TWO DOMAIN PROTEIN RIGHT. WE MAY NEED TO BUILD ALSO A COMPLEMENTARY LIBRARY BASED ON SCOP SINGLE DOMAIN PROTEIN WHICH CAN MAKE THE ALIGNMENT EASIER. ALSO THIS IS ALSO A PROBLEM OF TEMPLATE COMBINATION OR IF WE USE ONLY ONE RIGHT TEMPLATE, BUT GET FULL ALIGNMENT, THIS MAY EASIER. also means PSI-BLAST can't generate very long alignments. ANOTHER TRICK IS: USE THE TEMPLATES FOUND BY PSI-BLAST, BUT GENERATE ALIGNMETNS USING LOBSTER, THEN GENERATE STX FOR TWO DOMAIN PROTEINS IF WE BELIEVE THE TWO DOMAINS BOTH SHOULD BE USED, BUT PSI BLAST ONLY GENERATE A LOCAL ALIGNMENT FOR ONE DOMAIN. T0289 decision: intervene the FOLDpro, leave 3Dpro as it is. only take the templates in frcom.pir that appears in cm.pir regenerate stx and put the model as the first model. using a script ~/modeller.sh pir_file output_dir then convert pir to pdb: prosys/pdb2casp.pl pdb_file pir_file model_index output_file then compare the new pdb with the casp1 to casp5 to make final decison. to do. currently for foldpro: 289_3 (frcom) is good. 289_1: (not good, the domain 1 is too small) 289_2 (cm: only one domain) strategy: 1. generate a human from frcom, use it as model 1. 2. move current model 1 to model 2 T0280.human is created. compare it to model 1 of 3dpro: DALI: z-score: 19.4, alignmed res: 207, rmsd: 21 seq ind=53%. CE: rmsd: 2.1, z-score: 6.6, sequence ind: 62.8%, aligned = 191. so it is similar enough. decide: submit the new human as model 1. **************************************************************** domain decision: 1-210, 211-313. resubmit as well **************************************************************** 3Dpro: frocom/cmfr is a two domain prediction. another approach is to regroup templates in from, remove incosistent templates and regenerate stx. group I: Looks like 1YW4 is a very good template. but the domain definition is very ambiguious for this case. according to PDB. this protein is a two domain proteins. 1UWY is also two domains. (1-296, 297-403) 1H8L two domains. (domain classification is same as 1YW4 in GO) 2G9D: one domain (but looks like two domains as 1YW4). 2BCO: Succinylglutamate desuccinylase (2BCO:A, B) * hydrolase activity, acting on ester bonds * metabolism 1Yw6. template 1UWY: two domains group II: template 1O5W: two domains template 1QYD: fold is different from 1UWY. the folds of 1UWY and 1O5W is different. (next combination should check scop or stx clustering or construct a phylogeny tree) 3Dpro and FOLDpro finds a lot of significant templates. Now the key issue is alignment. Lobster generate a profile-profile full length alignment that cover all regions of T0289 for 1UWY, 1O5W, 1H8L, 1YW4, 1QYD, 2G9D. Let's see how cm and fr are combied. if a lot of small fragments are added , it won't help. A more clever combination is to select biggest fragment from a number of templates. So the order of combination consier both ranking and also the fragment contribution. (TO DO FUTURE). --------------------------------------------------------------------------------------------- T0289: (easy) Succinylglutamate desuccinylase / Aspartoacylase family (single domain????) This is an interesting target. 3Dpro: profile includes 43 sequences FOLDpro: profile includes 113 sequences. But psi-blast in 3Dpro generate longer local alignments covering more regions. Both find the same templates, but 3Dpro alignment covers: firt 171 residues. FODpro only covers the first 102 residues. That means larger NR database not necessarily yields longer, more signifcant alignments. Let's wait to see what FR finds and how they are combined together. Templates: 2G9D: Succinylglutamate desuccinylase. function class: hydrolase. 1YW4: Succinylglutamate desuccinylase. function class: hydrolase. 1YW6: same as above. 2BCO: same as above. All from the same family and are consistent. 2G9D has a lot of disordered regions. ------------------------------------------------------------------------- T0288 is an easy target. a lot of matches from psi-blast. -------------------------------------------------------------------------- TRICKS: MANUALLY EVALUATE MODELS FOR HARD TARGETS USING SA, SS, VERIFY3D, SVM RANK SCORE, and VISUAL INSPECTION. ------------------------------------------------------------------------ T0287: hard target Majority of secondary structure elements are helices. It has three strands which are predicted to form an anti-parallel beta-sheet. For this kind of hard target, we at least can get this beta-sheet right which is a lot of GDT-TS score. Then we try to put a few helices in the right position. --------------------------------------------------------------------------- T0286 is an easy target, identified by CDD. homologous to cd00229.3 (representatives: 1ESC, 1ESE) SGNH_hydrolase, or GDSL_hydrolase, is a diverse family of lipases and esterases. The tertiary fold of the enzyme is substantially different from that of the alpha/beta hydrolase family and unique among all known hydrolases; its active site closely resembles the typical Ser-His-Asp(Glu) triad from other serine hydrolases, but may lack the carboxlic acid. from CATH and SCOP (about 1ESC) SCOP Classification (version 1.69) Domain Info Class Fold Superfamily Family Domain Species d1esc__ Alpha and beta proteins (a/b) Flavodoxin-like SGNH hydrolase Esterase Esterase Streptomyces scabies CATH Classification (version v2.6.0) Domain Class Architecture Topology Homology 1esc00 Alpha Beta 3-Layer(aba) Sandwich Rossmann fold HYDROLASE both FOLDpro/3Dpro found related proteins in Flavodoxin fold. compare models: 3dpro: model 1 and 3? aligned: 176, rmsd: 2.1, seq ind: 81%, z = 22. foldpro: aligned residues: 181, rmsd: 2.9, seq ind: 77%, z = 18. So those models are very similar. --------------------------------------------------------------------------- DECIDE NOT TO INTERVENE ANYMORE UNLESS FIND SOME VERY OBVIOUS PROBLEM. T0285: (hard target) BETApro and SVMcon finds a lot of common contacts. 285 is a hard target. According to secondary stx, the two ends are two long helices. in the middle are short beta-strands (a little helix). 3Dpro predict a alpha+beta protein (cool stx). FOLDpro: the model is pretty loose. Model 2 looks better. Let's try to use verify 3D to evaluate models. 3Dpro: verify3d ranking is consistent with svm ranking. model 1: 25. FOLDpro: model 1: 12, model 2: 15, model 3: 16, model 4: 16, model 5: 17. The svm score of model 1 and model 2 is very close. And Model 2 is also ranked as third by 3Dpro. So decide to exchange model 1 and model 2 of foldpro. Exchange model 1 and 2. and also model 2 seems to fit secondary stx better. DONE! compare model 1 of foldpro with 3dpro: there is a little similarity. compare model 2 (used as new model 1) of foldpro with model 1 of 3dpro: there is no similarity at all. that means, visual similarity is not reliable at all. FINAL DECISION: use the orginal model 1 and model 2. don't exchange. I guess it is a new fold. Human prediction of topology: BETApro prediction of four strands: 3--4:A:[89-92:101-98]:1.78 2--3:P:[62-65:89-92]:0.82 1--4:A:[39-45:104-98]:0.78 Key elements of the protein: H1 E1 E2 E3 E4 H2 I guess the topology is: Strand pairings as above. H1 and H2 are also in parallel. So it is two layer protein. One layer is four strands, another layer is two helices. Formulate an ab-initio protein stx prediction algorithms: step 1: generate building blocks/fragment according to secondary stx and associate a flexibility score to each block (bending flexibility, switching flexibility that the block was changed to other SS element, buried/exposed score). generate local stx for these blocks. also predict the turning elements. Step 2: Alignment beta building blocks. (may select a number of patterns according to beta pro) Thus will generate a number of different trajectories. step 3: align helical elements according beta-sheet and turns. Step 4: MCMC refinement. adjust positions of elements or Ca atoms and select by energy function Step 5: clustering stx using stx alignments. ----------------------------------------------------------------------------- T0284 easy --------------------------------------------------------------------------- T0283 (hard) 3D: (models are ok) (four helix bundle) FOLDpro and 3Dpro: 1P68A (same as sparks3) FOLDpro also identify 1NI7, same as (raptor, forte, and meta-tasser) visual: sparks3 predict four helix bundle. robetta-ab: three-helix bundle. align against Robetta using Dali: foldpro vs robetata: Z-score = 1.4, RMSD = 8.2, %= 44. sparks vs Robetta: z=2.0, RMSD: 3.5, %=8. foldpor vs sparks: z=3.5, rmsd=2.8, %=5. abipro vs robetta: z-score = 2.6, aligned: 68, rmsd: 7.3, %= 6. domain: All except ginzu: 1 domain Contact: betapro: ok distill: all short range contacts. GPCPRED: >=8 separation Pssum (Hamilton): longe range PROFcon: >= 6 SAM: longe range SVMcon: ok ------------------------------------------------------------------------------ T0284: (easy) FOLDpro: model 3 (frcom), 3Dpro: model 2 (frcom): Ca-Ca clashes (about 15 pairs) due to stx inconsistency in a lot of templates. Solution: a) we should have a Ca-Ca clashing detecting script given a model (Ca-Ca < 4.? Angstrom) b) Ca-Ca clashes can be used to discard some model or used to discard some template in model generation c) we should use stx alignment to check consistency among top rank templates. choose the largest and consistent cluster of templates to generate stx in future. Next time: a) visually inspect, if clashes, we may discard the model and replace it with Model 6. b) or detecting Ca-Ca clashes, and regenerate model by selecting half of templates? or top five templates? To do: write a script to check clashes. send alert email to me if clashes happens. Clashes may give some advantage, but will be seriously penalized. From this paper: CASP6 data processing and automatic evaluation at the protein structure prediction center Andriy Kryshtafovych, Maciej Milostan, Lukasz Szajkowski, Pawel Daniluk, Krzysztof Fidelis * definition of geometric irregularities: dist: Ca-Ca distance irregularity: 0.1 < dist < 3.6 or dist > 4.0 Severe collision: 0.1 < dist < 1.9 same position: dist < 0.1 They also check model similarity between predictions and identify similar or identical models. done. Another reason that causes clashs is that template is not good. T0287, FOLDpro, model 1. Mistake: for models with more than 5 (>5) clashes, or 1 server clash, the model should be discarded. Use model 6 or 7 to replace it. Generate clash report. CASP6 penalization policy of clash: hose models with greater than 50 bumps (where the C-C distances were between 1.9 Å and 3.6 Å) or that had more than 4 severe clashes (C-C distances of less than 1.9 Å) were penalized. The choice of cut-offs was rather arbitrary, but also fairly generous. We checked a selection of 1000 chains from the PDB and found just one chain with more than 16 minor clashes. Penalized models were inspected manually and those that contained visible backbone-backbone clashes or were that were otherwise clearly unfeasible [Fig. 1(a,b)] had both their AL0 and GDT-TS z-scores set to 0. In total, 55 first models were penalized in this way. Reference: Assessment of predictions submitted for the CASP6 comparative modeling category Michael Tress *, Iakes Ezkurdia, Osvaldo Graña, Gonzalo López, Alfonso Valencia Comments: Many servers select 1MUM. FOLDpro select a lot, but 1MUM is not in top five shown in the model file (ranked no. 6). But definitely is used in modeling. 1MUM resolution is 1.9. 1S2V is 2.1 A. they are in the same family. In the first round of pdb-blast, 1MUM is ranked #2, in the second round it was ranked #6. I think most other people probably only use use blast to search PDB to get 1MUM. Stx-stx alignment between SP3 for T0284 using CE: Rmsd = 1.5Å Z-Score = 7.5 Sequence identity = 97.0% Aligned/gap positions = 265/14 between sp3 and karpyris: (both are using the same template) Rmsd = 1.5Å Z-Score = 7.7 Sequence identity = 98.9% Aligned/gap positions = 273/5 sparks 3 and jigsaw (using Dali), jigsaw also use multiple templates. z=34, aligned residues: 261, rmsd: 1.7, seq identity: 95. sparks 3 and foldpro (using dali) z=37.5, aligned residues: 275, RMSD: 2.7, seq identity: 96. stxs are very similar. Domain: FOLDpro, 1 domain, same as most others (such as meta-dp) VERY IMPORTANT: THE BETApro prediction for this target, the strands are almost completely correctly predicted except for the pairing of the first and the last strands. This means that our BETApro can be used to reconstruct this kind of protein structure. ------------------------------------------------------------------------ T0287: hard target Fugue find 1V0D, same as 3Dpro (rank #1). There is no other consensus. Domain: 1 domain same as meta-dp. RR prediction: SVMcon ususuall has a few commons with SAM-T06. Foldpro <-> Robetta: no similarity Foldpro <-> sp3: z=1.0, aligned res:65, rmsd: 3.7, seq ind: 3%. sp3 <-> Robetta: no similarity 3dpro <->robetta: z=0.2, aligned res: 63, rmsd: 9.2, seq id: 5. abipro <-> robetta: z=0.1, aligned res: 57, rmsd: 10.3,seq ind=4%. all stxs are very different. ----------------------------------------------------------------------------------- T0283: hard target predict as four helix bundle ------------------------------------------------------------------------------------ ##################################################################################### Very important adjustment: 1. show up to 8 parents change pdb2casp.pl file. 2. if no chain id, only need to show the four-letter pdb code. don't need to add "_". FOUR LETTER PDB CODE IS GOOD ENOUGH. ######################################################################################