T0386: some easy. two ends (30 residues each) are hard

final decision on domain:
	predict as two domains:
	a) domain parser predict three fragmetns
	b) fragment 1 and fragment 2 are very close, so they should be in the same domain
	d) domain fusion mechanims?
	e) alignment signal?
	f) secondary stx pattern: the last end, there a lot of beta-sheet.

	use human 5, domain parser cuts it into three domains.

	if we check the two fragments at both ends, they are very close.
	in this case, two fragments are joined into one domain.
	(domain fusion mechanism)
	Also, we can check alignment boundary.

	so we can cut it into two domains:

	1-33 and 263-end: domain 1.
	34-262: domain 2

	predict: 1-33 and 263-end form a a-b-a domain.

according to CDD blast, it seems to have two domains too. one is filament domain, another one is 
unknown (filament domain has a HPFXGNG motif which is aligned well in both psi-blast and lobster alignment)
it looks like lobster alignment is better in this case?	
CDD provides evidence that this protein is probably two-domain.


foldpro:
	use human3 as model 1
	use human 5 as model 3

	final:
	use human 0 as model 1.

3dpro:
	keep model 3
	use human 3 as model 1
	use human 2, 4, 5 as model 2, 4, 5.

2F6S:  filamentation protein
2g03: filamenttation protein

cm_ab seems to be the best, but is has a knot.
now use 2f6s and 2g03 to combine with ab to generate models respectively.


T0385: hard


decision:
	use human models 1-5 as models 1-5 for foldpro and 3dpro respectively.


select five models for foldpro and 3dpro each from top 150 models
compare foldpro

1-2: 6.6
1-3: 5.7
1-4: 5.7
1-5: 5.5
2-3: 6.1
2-4: 5.6
2-5: 5.5
3-4: 5.3
3-5: 5.3
4-5: 5.5

compare 3dpro: (all belong to ferritin family according to scop)
1-2: 5.7
1-3: 6.3
1-4: 5.6
1-5: 6.5
2-3: 5.7
2-4: 6.3
2-5: 5.7
3-4: 5.5
3-5: 6.6
4-5: 5.7

compare 3dpro model 1 and foldpro model 1:
Rmsd = 1.2Å Z-Score = 6.7
Sequence identity = 98.0%
Aligned/gap positions = 148/2


generate stx for top 150 templates

foldpro:
	model 5, 7 better
3dpro:
	model 6,7 (ab initio) are better.


---------------------------------------------------------------------------------------------------------
observation:
foldpro alignment often fails for multi-domain protein even though the correct template is identified.
the reason is: one correct domain, one wrong domain.
---------------------------------------------------------------------------------------------------------

T0384: easy
foldpro: 
	keep model 1, use easy1-4 as model 2-5
3dpro:
	keep model 1, use easy2-5 as model 2-5


T0383: hard

foldpro
	only model 6 is good.

	use human 1-4 as model 1-4
	model 6 as model 5

	human models are generated from top 150 tempaltes and evaluated by model_check.

3dpro:
	only ab-initio is slightly better.

	use human 1-4 as model 1-4
	ab model as model 5.

--------------------------------------------------------------------------

	***********************************************************
Weakness of model_check:

an extended helix that match with secondary stx well, can have high score.

the stx is not compact, usually the sov acc matching score is very low.

	We need to enlarge dataset to train?
	************************************************************

In this case, sa score is usually about 50%.
ss score can be high (if one ss dominates) or low.

We need to add this kind of examples to penalize it or we need to add a compactness
term.


	good selection criteria:

		predicted gdt-ts > 0.30
		sa matching score > 0.6
		ss matching score > 0.5

FOLD recognition using model_check evaluation:
	
	create a smaller library (25% identity)
	generate an alignment and a model for query using each template
	use model_check to evalute the model and generate a evaluation score
	select model: with gdt > 0.3, sa > 0.6, ss > 0.5
	
	Then rank final models using gdt scores and return top ranked models
	
	penalize models with high gdt but low sa match scoer (<0.55). these models 
	are usually just an extended string. usually we can halve the score.
-------------------------------------------------------------------------


T0382: hard
target.
sspro: a helical protein.
the predictions of 3dpro and foldpro are not good.
the ab-initio model's score is very high. 

now submit T0382 to 3Djury.


3dpro:
	use model 8 (ab initio ) as model 1
	use human 1-4 as model 2-5

manually select 4 models from top 130 models generated by modeller
	as model 2-5.

foldpro:
	use manually selected 4 models on 3dpro.

now generating another 150 stx on foldpro (mine4)


T0381: easy
3dpro:
	use model 2 as model 1
	use model 1 as model 2
	use easy 1,3,4 as model 3-5

FR system: difficulty in 2-domain protein. only the N domain 3-helical bundle is identified. 
second domain is not right. the problem of global alignment.

domain 1: 1-87
domain 2: 88-end.

foldpro:
	keep model 1,2
	use easy 1,3,4 as model 3-5

more change:
use easy 1 as model 1
use current model 1 as model 3.

---------------------------------------------------------------------------------------------------------------

	************next CASP, must train a system specific for FR/A.***************

		****************************************************************************
	another idea:
		generate one model using modeller seems to be very fast. 
		for hard target, we acutually, we generate thousands model, then use energy function
		to evaluate. it may only take one day. In this case, it is very likely we can identify 
		good models?

	Actually, this is a new FR idea, especially for hard target (FR/A or some kind of NF)
		
		idea: 
			build a library without sequence identity (25%) (at most 3000 proteins)
			
			align query against the sequence in the template, and generate a 3D models using modeller

			use model check to evaluate, record the score


			pick a few templates with maximum scores.
		*******************************************************************************
	We can add this layer right after psi-blast. or add it after foldpro if time permits.

	paper:

			Protein Fold Recognition Using Model Quality Evaluation
----------------------------------------------------------------------------------------------------------------

			7/17/2006
The script of generating a stx from a specified template is done.
gen_stx_from_pdb_code.pl. 
the option file is in mine 3: /var/preserve/prosys/web/htdocs/T0372/gen_stx_option

inputs: option file, query_fasta_file, template id(pdb_code + chain id), output dir

	now we can use 3d-jury to help us to find templates for hard targets.

test on T0372 to use 1ne9a to generate a stx in /var/preserve/prosys/web/htdocs/T0372/human

use model_check to evaluate score:  0.3842301. The score is much better than our previous score, 
but still significantly lower than SP2 (62).  that means sp3 has better alignment than Lobster.
CE alignment betwen our human model and sp3 model
Rmsd = 3.1Å Z-Score = 6.7
Sequence identity = 29.5%
Aligned/gap positions = 258/67
------------------------------------------------------------------------------------------------------------------
T0380: easy
foldpro:
	replace model 5 with easy model 3.
	replace model 2 with model 7.
3dpro:
	use model 7 and model 8 as model 3, 4.
	use easy model 5 as model 5.


-----------------------------------------------------------------------------------
Post analysis

T0372 is a hard target, why most groups find 1ne9, but we didn't?????????????????????!!!!!!!!!!
		most people are using 3D-Jury to identify templates??????????????????????????????????

	use model check, score of sp3 based on 1ne9 is 0.62. looks like it is a correct template.

	next CASP, we need to use 3d-jury when no templates are identified? try to use 3d-jury, 
	1ne9 is ranked as no. 2.
	we need a script to general an pir alignment given a pdb id using sequence in our own database.
	then we can generate a stx.


		*****************NEW IDEAS********************
	1ne9 is ranked about 100.
	Another way:
	for hard target, select top 200 templates, generate 200 models, 
	then run model check, select a few with highest scores.
	

	Add another layer to the hard target (FR/A):

	another trick is to rank templates using a few alignment scores individually (hhsearch, prc-hmm, compass,
	psi-blast, hmmer to make a internal meta prediction, lobster and so on????). 

	for T0272, our domain prediction is also wrong. actually, there is a domain boundary signal in the
	alignments.

	adjustment:
		for the last six targets, submit the targets to 3D-Jury too. 
		so if we see some difficulty, we can use 3D-Jury.


	new strategy for hard targets:
	use foldpro ranking (top 200)
	use 3djury ranking (we need a script to submit tasks to 3d-jury and parse output)
	use alignment ranking in foldpro
	then generate a model using a different templates,
	then use model check to evaluate and choose models.


		**************BIG FAILURE***********************************
			NEXT TIME, WE NEED TO USE 3D-JURY
			BAKER USE 3D-JURY, MANY GROUPS USE 3D-JURY.
			WE ALSO NEED TO BUILD OUR OWN 3D-JURY SYSTEM using
			EXISTING ALIGNMENT METHODS.
		************************************************************

	Another lesson:
		we need to use domain bounary signal explicitly encoded in sequence alignment.

	important lesson:	(******************VERY VERY VERY VERY IMPORATNT LESSION***********************)
		we need to avoid failure of FR. trick: use more rankings, select more tempaltes
		evaluate by model_check, and use external FR (such as 3D-jury)


Key steps in next casp: meta-server
	1. external meta server
		once receiving a casp target, submit it to all other servers.
		results are send to a special mail box. 
		it can be check my human when target is hard.
		or the server automatically process it.
		the model is then evaluated by model_check
		the external server include "3D-JUry". 
	2. internal meta server
		use alignment score to rank templates. select top from each alignment methods
		then generate models from top templates. then use model_check to evaluate.
Why do we need meta server?
	1. meta server can compete by itself. and we have model check to evaluate models.
	2. take model from meta server when foldpro encounters difficulty and may miss some templates.
		by doing this, we can avoid fatal failure of foldpro (like T0372, everybody get it right,
		but we didn't.)
-----------------------------------------------------------------------------------------------------
	CASP 7 experience

Problem identified with foldpro:

good: templates can be identified.

bad: profile-profile alignment quaility is not good, especially when invovling muti-domain protein

combination is not good: a) alignment quality is not good b) don't remove inconsistent templates.

----------------------------------------------------------------------------------------------------
How to evaluate if a relative small region is a domain or just an extension of another domain?

1. length (> 50?)
2. fold independently (does it rely on other domain?)
3. structure or function unit?
4. alignment signal?
5. any mechanisms: domain/gene fusion if it is an insertion in the middel? at two ends, is it just a dangling resion, protrusion
region, or linker and so on?
6. if a region consist of a few insertions, it probably not a domain, because, a simple gene fusion
can't produce this kind of domain structure.
----------------------------------------------------------------------------------------------------


T0379: easy
foldpro:
keep cm 1, use easy 1-4 as model 2-5
domain: visual inspection: should be two domains. but domain parser classified it into one domain.
(we need to try other tools (Xu's tool), Sander's tool, or consulting CATH and SCOP database).
check templates:
1CQZ: single domain:  had-like
1VJ5: single domain: had-like
1S8O: single domain: had-like
2B0C: ? 

According to the T0379.align, there is domain cutting signal at about position 90. 
so let it be two domains.
	domain 1: 1-15, 87-end.
	domain 2: 16-84

Our algorithms: 
	look at alignment, domain boundary signal.
	check scop / cath domain definition, if consistent use.
	check scop domain definition, domain parser, Xu's domain parser, Sanders's domain parser
	take the majority vote.
	script to check to avoid the cutting in the middle of beta-sheet.
3dpro:
keep model 1, use easy2-5 as model 2-5.

Further domain analsis using templates identified by foldpro:
1L7M: scop: HAD like domain, catch: rossman fold: single domain
1Q92: same as above
1NNL: scop: had-like, cath: rossmann fold
1k1e: same
exception: 1RQL: scop: had like. cath: domain 1: rossman fold, domain 2: orthogonal bundle
1ZRN: scop: had like, cath: two domains. 

			IMPORTANT NOTICES:
even for scop, it says the fold contains an insertion of a (sub)domain. for d1ZRN__
# Root: scop
# Class: Alpha and beta proteins (a/b) [51349]
Mainly parallel beta sheets (beta-alpha-beta units)
# Fold: HAD-like [56783]
3 layers: a/b/a; parallel beta-sheet of 6 strands, order 321456
# Superfamily: HAD-like [56784]
contains an insert alpha+beta subdomain; similar overall fold to the Cof family
usually contains an insertion (sub)domain after strand 1
# Family: L-2-Haloacid dehalogenase, HAD [56785]
the insertion subdomain is a 4-helical bundle
# Protein: L-2-Haloacid dehalogenase, HAD [56786]

That means, proteins in scop domain database can actually have contain small domain.
scop usually doesn't cut the closely coupled domains. The key issue is: do we treat it
as a domain or subdomain?


T0378: easy
foldpro:
keep model 1-3, use easy 1, 3 as model 4, 5.
3dpro:
keep model 1-3, use easy 1, 3 as model 4, 5

compare foldpro cm1 with 3dpro cm1
Rmsd = 0.9Å Z-Score = 7.6
Sequence identity = 92.7%
Aligned/gap positions = 246/10

compare 3dpro cm1 vs foldpro easy 1
Rmsd = 1.4Å Z-Score = 7.4
Sequence identity = 90.7%
Aligned/gap positions = 246/12

compare foldpro cm 1 vs 3dpro easy 1
Rmsd = 1.0Å Z-Score = 7.5
Sequence identity = 95.2%
Aligned/gap positions = 249/4

compare foldpro easy 1 with 3dpro easy 1
Rmsd = 1.0Å Z-Score = 7.4
Sequence identity = 91.5%
Aligned/gap positions = 247/12


T0376: easy
foldpro:

templates:
1HL2A, B: fold: tim beta/alpha barrel, super: aldolase, fam: class I aldolase. scop/cath: 1 domain.
1FDYA, B: same as above
1NAL1: same as above
1DHPA: same as above
1S5VA: same as above
1S5WA: same as above
2A6L: ?
1S5T: same as above
2A6N: ?
1XKY: ?
1O5K: same as above
1F6K: same as above
1W3I: same as above
So both cath and scop classify these templates into one domain. for this target, there are a few helices dangling
at the C terminus. but they may not be a independent domain in its own right. but the last region really looks
like an orthognal helix bundle. these extended helices may be functional important regions?
so final decision: single domain (unchanged so far).
keep model 1 and 2, use easy 1,3,5 as model 3,4,5.

3dpro:
use easy model 1-3 as models 3, 4, 5.


T0375: easy
foldpro:
keep model 1 and 2
use easy 1-3 as mode 3-5.
domain is ambiguious:
1LIO
	scop: one domain
	cath: two domains: 3-layer(aba), 2-layer sandwich

	This protein, I somewhat prefer one domain. the extended small beta-sheet is too complex. 
	may not be a domain by itself.

scop description:
# Root: scop
# Class: Alpha and beta proteins (a/b) [51349]
Mainly parallel beta sheets (beta-alpha-beta units)
# Fold: Ribokinase-like [53612]
core: 3 layers: a/b/a; mixed beta-sheet of 8 strands, order 21345678, strand 7 is antiparallel to the rest
potential superfamily: members of this fold have similar functions but different ATP-binding sites
# Superfamily: Ribokinase-like [53613]
has extra strand located between strands 2 and 3
# Family: Ribokinase-like [53614]
# Protein: Adenosine kinase [53617]
# Species: Toxoplasma gondii [53619]

so scop consider the protruded strands as extra strand, not a (sub) domain.


1LIJ: same as 1LIO
2ABS
1DGM
1LII
2A9Y

domain decision:
	I would rather treat the beta-sheet as two long loops collide with each other to form a sheet.
	they may be just a function important site.

	this is also a problem of domain parser which cut in the middle of one beta-sheet (position at 133)
	so, if we say we have a problem to avoid domain parser to cut in the middel of a beta-sheet, this
	problem could be avoided. That is why we manually tweak the domain prediction.

3dpro:
	use easy1-4 as model 1-4, use current cm model 1 as model 5.

foldpro cm vs foldpro easy1:
Rmsd = 1.4Å Z-Score = 7.6
Sequence identity = 95.2%
Aligned/gap positions = 289/4

foldpro cm vs foldpro easy2
Rmsd = 1.8Å Z-Score = 7.6
Sequence identity = 93.8%
Aligned/gap positions = 290/8

foldpro easy1 vs easy 2
Rmsd = 2.1Å Z-Score = 7.5
Sequence identity = 94.4%
Aligned/gap positions = 288/8

3dpro cm vs foldpro cm
Rmsd = 2.1Å Z-Score = 7.5
Sequence identity = 93.7%
Aligned/gap positions = 287/16

3dpro cm vs 3dpro easy 1
Rmsd = 2.3Å Z-Score = 7.5
Sequence identity = 95.8%
Aligned/gap positions = 286/16

3dpro cm vs 3dpro easy 2
Rmsd = 2.2Å Z-Score = 7.5
Sequence identity = 95.8%
Aligned/gap positions = 285/16


foldpro cm vs 3dpro easy 1
Rmsd = 1.4Å Z-Score = 7.7
Sequence identity = 100.0%
Aligned/gap positions = 290/0

foldpro cm vs 3dpro easy 2
Rmsd = 1.0Å Z-Score = 7.8
Sequence identity = 100.0%
Aligned/gap positions = 295/0

3dpro easy 1 vs foldpro easy 1
Rmsd = 1.7Å Z-Score = 7.5
Sequence identity = 96.2%
Aligned/gap positions = 286/8


T0374: easy

3dpro:
keep model 1,2
use models 6, 7 as models 3,4
use easy1 as model 5. 

foldpro:
keep model 1
use 3dpro model 1 as model 2
keep model 3
use mode 8 as model 4
use easy model 1 as model 5


T0273: easy


3dpro:
	keep model 1 
		
	use easy1-4 as model 2-5

	(very good model 1, it extends one cm template longer than the same template used by foldpro)
	that means longer psi-blast alignment is better. sometime extension manually is necessary.

FOLDpro:
	keep model 1
	use 3dpro model 1 as model 2.
	use easy1,2,5 as model  3,4,5.

compare the model 1 of 3dpro and foldpro:
Rmsd = 2.2Å Z-Score = 6.2
Sequence identity = 88.1%
Aligned/gap positions = 135/7


T0372: hard  (WE FAILED THIS TARGET, SHOULD CHECK 3D-JURY, A GOOD TEMPLATE IS FOUND: 1NE9.)
3Dpro:
use model 7 to replace model 2

foldpro:
replace model 4 with model 6. (too big distance between two residues in the model)
replace model 5 with model 1 of 3dpro (too big distance between two residues in the model)


T0271: easy

foldpro:
domain: 1-80/203-end, 81-202  (right now the range is set at 1-50)

	New ideas:
	FURTHER IMPROVEMENT of DOMAIN CUTTING Algorithms. The same beta-sheet must be put in the same domain.
	(need a program to adjust the domain cutting according to beta-sheet pairings for known structures)

use easy model 1-4 as model 1
use current model 1 as model 5.

3dpro:
use easy model 1-4 as model 1-4
use current model 1 as model 5.


T0270: fr
3dpro:
	unchanged.

foldpro:
	?use model 1 of 3dpro as model 1 of foldpro.
	?the model 1 of foldpro includes too many templates, some of them are not very good. 

	**********final decision: model 1 is actually ok. so unchanged.************

T0369: hard
foldpro:
	use model 7 to replace model 1
3dpro:
	use model 4 as model 1, model 6 as model 4.


T0368: fr

foldpro:
	change it to one domain
	done.
	exchange model 1 and model 3.
	done.
3dpro:

	resubmit model 3 due to error. (1A17_)
	replace model 2 with model 6.
	resubmit model 5 due to error
	done.

	use human model to replace model 1. human model extend the cm template.
	the current model 1, the last two helix is too far away.


T0367: fr

compare model 1 of 3dpro and foldpro:
Rmsd = 1.1Å Z-Score = 6.7
Sequence identity = 100.0%
Aligned/gap positions = 124/0


3Dpro: identify three positive templates
1	1UFBA	1.09 (reso: 1.9, fold: four-helical up/down bundle, super: Nucleotidyltransferase binding, fam: HEPN domain)
2	1WOLA	0.95 (reso: 1.62, HEPN protein)
3	1O3UA	0.88 (reso: 1.75, fam: same as 1UF8: HEPN domain)
So all three are consistent. frcom can be used as top.
No changes are necessary.

FOLDPRO:
1	1O3UA	1.52
2	1UFBA	1.43
3	1WOLA	1.31
Use model 8 as model 5. 


T0366: easy

two cm models of foldpro and 3d pro are exactly same.

FOLDpro:

keep cm as model 1, frcom as model 2
use easy 1, 2, 5 as model 3, 4, 5.


cm vs easy 1
Rmsd = 0.7Å Z-Score = 6.5
Sequence identity = 100.0%
Aligned/gap positions = 104/0

Rmsd = 0.6Å Z-Score = 6.3
Sequence identity = 100.0%
Aligned/gap positions = 101/0

Rmsd = 1.5Å Z-Score = 5.9
Sequence identity = 100.0%
Aligned/gap positions = 92/0

cm vs frcom
Rmsd = 2.3Å Z-Score = 5.9
Sequence identity = 100.0%
Aligned/gap positions = 93/0

cm vs cm of 3dpro
Rmsd = 0.0Å Z-Score = 6.6
Sequence identity = 100.0%
Aligned/gap positions = 106/0


3Dpro:
	keep cm model 1.
	use easy model 1,2,3,5 as model 2-4.

	cm vs easy1
Rmsd = 0.7Å Z-Score = 6.5
Sequence identity = 100.0%
Aligned/gap positions = 104/0

	cm vs easy2
Rmsd = 0.6Å Z-Score = 6.3
Sequence identity = 100.0%
Aligned/gap positions = 101/0

	cm vs easy 3
Rmsd = 0.4Å Z-Score = 6.3
Sequence identity = 100.0%
Aligned/gap positions = 96/0

	cm vs easy 5
Rmsd = 1.6Å Z-Score = 5.9
Sequence identity = 100.0%
Aligned/gap positions = 93/0

	easy 1 vs easy 2
Rmsd = 0.8Å Z-Score = 6.3
Sequence identity = 100.0%
Aligned/gap positions = 102/0

	easy 1 vs easy 3
Rmsd = 0.6Å Z-Score = 6.2
Sequence identity = 100.0%
Aligned/gap positions = 97/0

	easy 1 vs easy 5
Rmsd = 1.4Å Z-Score = 5.9
Sequence identity = 100.0%
Aligned/gap positions = 93/0

	easy 2 vs easy 3
Rmsd = 0.7Å Z-Score = 6.2
Sequence identity = 100.0%
Aligned/gap positions = 98/0

	easy 2 vs easy 5
Rmsd = 1.8Å Z-Score = 5.9
Sequence identity = 100.0%
Aligned/gap positions = 93/0

	easy 3 vs easy 5
Rmsd = 1.5Å Z-Score = 5.9
Sequence identity = 100.0%
Aligned/gap positions = 92/0


---------------------------------------------------------------------------------------
			**********FURTHER IMPROVE MODEL_CHECK**********
One weakness of model_check:
it gives high score to a extended alhph-helix. that means it doen'st have a term
to favor compactness.  (one hard target model between 350-364....need to find it later)
So to futher improvement is to add compactness term such as gyration?
and add other measures such as energy terms: verify3d, parosa, procheck, skoknick,
baker terms and so on????

---------------------------------------------------------------------------------------
Interesting observation for this target: ****************************
	use BETApro/CMAPpro, no very long range contacts beyond 20 separation are predicted
	for this simple, helical protein (all of them are helix-loop-helix motifs). that means
	it is very difficult to predict long-range tertiary contact of this kind of simple
	topology protein (supposed to fold very fast according to B. Nolting). We may try to 
	design a speicial method to predict the helix orientation or helix contact for this 
	case. *****************************************************

T0365: fr
3Dpro: cm find one match 1T8B (evalue 0.005), not very significant. FR find three significant match
1	1T72A	1.28
2	1XWMA	1.06
3	1SUMB	1.02
4	1VCTA	-0.27
5	1I6ZA	-0.3
6	1HX1B	-0.33
Apparently, the cm model should not be put on the top in this case. (cmfr, cm_ab, frcom, fr1, fr2)
Acturally, cm template, 1T8B is also a PhoU-Like phosphate update regulator (Reso=3.23, R=0.216)

Decision:
	use model 3 (frcom) as model 1.
	use model 1 as model 2.
	keep model 4 and model 5. (fr1,fr2)
	use model 6 (fr3) as model 3.

FOLDpro: 
1	1SUMB	1.98  (scop: all alpha, fold: spectrin repeat lik, super: Phou-like, fam: Phou-like) reso=2, r=0.22
2	1T72A	1.86 (PhoU homolog, reso=2,9, R=0.216)
3	1XWMA	1.32  (Phou phosphate update regulator, reso=2.5, r=0.243)
4	1VCTA	0.02  (potassium channel related)
5	1I6ZA	-0.46 (scop: mainly alpha, fold: spectrin repeat like, super: BAG domain) --- fold identification is correct
current models (frcom, fr1, fr2, fr3, fr4)

DECISION:
	keep model 1-4. use frcom model of 3dpro as model 5.


Compare frcom of foldpro with frcom of 3dpro:
Rmsd = 2.1Å Z-Score = 7.0
Sequence identity = 81.5%
Aligned/gap positions = 211/8


T0364: easy

FOLDpro:
use easy 1,2 as model 4, 5.

3dpro:
	use easy1-3 as model 3-5.
	compare model 1 with easy model 1,2 to see similarity.


T0363: FR

3Dpro:
use model 6 to replace model 5.
FOLDpro:
use model 1 of 3Dpro as one model 3 of FOLDpro.


T0362: easy
FOLDpro:
keep model 1 (cmfr) and 2 (cm).
use easy1-3 as model 3,4,5
3Dpro:
keep model 1 and model 2
use easy1-3 as model 3, 4, 5.


T0361: hard
3Dpro:
	replace model 5 with model 7.
FOLDpro:
	unchanged.


T0360: FR
3Dpro:
	replace model 5 with model 7 (ab-initio)
FOLDpro: unchanged.

T0359: easy
FOLDPRO:
use easy 1,2,3,4, as model 2, 3, 4, 5. keep original model 1(cm)
3Dpro:
use easy 1,2,3,4, as model 2, 3, 4, 5. keep original model 1(cm)


T0358: hard
3dpro: unchanged.
foldpro:
	use model 2, 1, 6 as model 1, 2, 3


T0357: hard
FOLDpro:
replace model 2 with model 6
replace model 4 with model 7
3Dpro:
use model 1 of foldpro as model 1
use model 1 of 3dpro to replace model 4.

-------------------------------------------------------------------------------------------------------------
POST ANALYSIS:	 (weakness of using multiple templates when there are very significant match)

T0310: AN EXMAPLE WE USE A LOT OF TEMPLATES, BUT THE BEST SERVER USE ONLY ONE TEMPLATE. OUR SCORE IS SLIGHTLY
LOWER THAN OTHERS ACCORDING TO CE ALIGNMENT. the evalue of the first template: 1O20A is e-152, coverage is also very high. it should
be used only instead of using a lot of templates.
	This problem should not happen in next CASP since we have use very significant cm match on top
	instead of combining them.
T0289:
	an example that CM only find a portion of match. In this case, we either use significant FR templates (full length)
	as FOLDpro did.
	or try to add some FR into cm alignments to cover the full sequence.

In any case, we must combine templates or extend existing CM templates to cover the whole sequence.
-------------------------------------------------------------------------------------------------------------
Lesson learned:  (A POSSIBLE NEW APPROACH FOR PROTEIN DOMAIN PREDICTION, PARTICULARLY FOR AB-INITIO)
AB-intio
T0347: according to secondary stx, the first part is alpha/beta, the second part is alpha.
So this should be classifiy into two domains (most method classify this into two domains).
our DOMpro also has a signal to classify it into two domains. 
FOR AB-INITIO in future:

	a) check DOMPRO. must trust DOMpro because it is the best ab-intio (espeicall when prot size > 200)

	*******************************************************************************************
	b) check secondary stx prediction to see secondary stx patterns (types)
	********************************************************************************************

	c) refer to DOMSSEA as well (additional evidence)

	d) size > 200, it is possible to have two domains. size < 130, usually should be one domain.

	e) for template with svm score > -0.5, also need to consult FR.

	***************************************************************************************
	f) check alignment file. (***************VERY VERY VERY USEFUL**********************)
	****************************************************************************************

	Acutally, for this one FOLDpro find 1VK1 (svm=-0.25) which is used by many other servers. according to
	this target, domain should be 2. we should have sticked to this one. Of course the domain architecture
	can be adjusted.
-------------------------------------------------------------------------------------------------------------
T0356: hard
large hard protein
at least it has two domains (maybe three domains------------VERY POSSIBLE! (third domain from 347 - end?)
take the front and end parts and submit them to FOLDpro public.
Later, we may combine them with the first template of 3Dpro.

FOLDPRO:
model 7, 1, 3, 5, 6 as model 1,2,3,4,5
3Dpro:
models human, 7, and 3 of FOLDpro and models 7,8 of 3Dpro as model 1,2,3,4,5.

DECISION:
DOMAIN IS SET TO 3 by checking secondary stx pattern and sequence alignment file.
---------------------------------------------------------------------------------------------
T0355: hard FR (domain hard)

FINAL DECISION:
select human1, human2, human3, human4, human6 for both FOLDpro and 3Dpro.
(to distingish them, change model order of model 2 and model 3 for FOLDpro and 3Dpro)
change domain to single domain.

psi-blast only finds c-terminal region (1/3 of total length)
Strategy:

FOLDPRO:

1. submit hte first 2/3 of sequence to public FOLDpro to see if we can 
identify some significant match. then use it to combine with the first half later.
	running, but not necessary anymore.

2. generate a stx using all possible alignment in pir, particularlly long templates

3Dpro:
model 4 (1KA9F) and 5 (1THFD) are better becaues it uses long templates.
according to sspro (alternating helix and strand) and model4/5, it is a beta-barrel.
now life is eaiser.
model 7 is also ok. model 8 also tries to make a barrel, but front region is not well covered.
model 6 also try to make a barrel, but front region is not well aligned.
model 3(frcom): try to make a barrel, but too many conflicts
model 1: half barrel.


PROBLEMS:
psi-blast alignment is too short even though maybe the whole template can be used

advanced combination tends to select short frageents from many templates. should select 
long fragments. so templates should be ranked also by alignment coverage, not just svm scores.


CHECK FOLDpro (fr models)
frcom: too many conflicts.
model 4-7: try to make barrels, but alignment is not good
model 8 is better. (1QOPA)


Analyze the following 12 tempates:
1KA9F: tim beta/alpha barrel. (Histidine biosythesis enzyme family). reso=2.5
		model 4 (already generated)
		(human 2.pdb....)


1THFD: reso=1.45 (this one should be used), tim beta/alpha barrel. (Histidine biosythesis enzyme family)
		model 5 (already generated)
		first model???? (human1.pdb....)

1QOP: reso=1.4, fold: tim beta/alpha-barrel, Ribulose-phosphate binindg barrel (super), tryptophan biosynthesis enzyem family
		human4.pir

1H5Y, reso=2.0, tim barrel, family: histidine biosynthesis enzyme
	human3.pir

1O5KA: tim alpha/beta barrrel, super: aldolase, fam: Class I aldolase
	human5.pir

1RD5: reso = 2.02, tryptophan synthase (no scop definition yet)
	human6.pir

1P4C, reso=1.35, fold: tim beta/alpha barrel, super: FMN-linked oxidoreductase, fam: FMN-linked oxidoreductase
	human7.pir

1F6K, reso=1.6, fold: tim beta/alpha barrel, super: aldolase, fam: class I aldolase
	human8.pir

1vhn: reso=1.59, fold: tim beta/alpha barrel, super: FMN-linked oxidoreductase, fam: FMN-linked oxidoreductase
	human9.pir

1QO2, reso=1.85, tim beta/alpha barrel. (Histidine biosythesis enzyme family)
	humana.pir

1HL2: reso=1.8, 1F6K, reso=1.6, fold: tim beta/alpha barrel, super: aldolase, fam: class I aldolase
	humanb.pir

1GOX: reso=2, 1vhn: reso=1.59, fold: tim beta/alpha barrel, super: FMN-linked oxidoreductase, fam: FMN-linked oxidoreductase
	humanc.pir
	

T0354: hard
FOLDpro:
replace model 2 with model 6.
3Dpro:
use model 8 as model 1
use mode 7 as model 5

T0353: hard
3Dpro:  MODEL 1 IS THE SAME AS FOLDPRO. leave it.
FOLDpro: DECISON: NO CLEAR PREFERENCE. LEAVE IT.
model 1 vs model 2
Rmsd = 3.1Å Z-Score = 4.4
Sequence identity = 37.7%
Aligned/gap positions = 69/4

model 1 vs model 3
Rmsd = 2.9Å Z-Score = 4.6
Sequence identity = 39.2%
Aligned/gap positions = 74/8

model 1 vs model 4
Rmsd = 2.5Å Z-Score = 4.1
Sequence identity = 21.1%
Aligned/gap positions = 71/16

model 1 vs model 5
Rmsd = 3.8Å Z-Score = 3.9
Sequence identity = 39.4%
Aligned/gap positions = 71/18

model 5 vs model 2
Rmsd = 2.2Å Z-Score = 4.9
Sequence identity = 100.0%
Aligned/gap positions = 70/0

model 5 vs model 3
Rmsd = 5.3Å Z-Score = 4.6
Sequence identity = 93.6%
Aligned/gap positions = 78/6

model 5 vs model 4
Rmsd = 2.7Å Z-Score = 4.6
Sequence identity = 100.0%
Aligned/gap positions = 76/0

model 2 vs model 3
Rmsd = 2.1Å Z-Score = 5.0
Sequence identity = 100.0%
Aligned/gap positions = 70/0

model 2 vs model 4
Rmsd = 1.7Å Z-Score = 4.9
Sequence identity = 82.6%
Aligned/gap positions = 69/2

model 3 vs model 4
Rmsd = 4.7Å Z-Score = 4.7
Sequence identity = 86.6%
Aligned/gap positions = 82/2

so all five models are very similar, especially models 2,3,4,5 are very similar.

-------------------------------------------------------------------------------------------------------
			EVALAUTION OF T0313
Using a lot of templates in very significant case is not always bad:
T0313: 3Dpro score is 79, very good. FOLDpro is 80 (highest so far).
	SO SOMETIMES, WE STILL NEED TO PUT MULTIPLE TEMPLATE CM ON THE TOP
	USING CE COMPARISON (COMAPRE CM-MULTI AGAINST EASIEST, CM1, CM2...)
FOR THE LAST ONE THIRD OF COMPETITION, WE MUST BE CAREFUL TO DECIDE WHEN TO USE MULTI-TEMPLATE,
WHEN NOT (JUDGE CASE BY CASE, CAREFULLY DO STX COMPARISON)
--------------------------------------------------------------------------------------------------------

T0352: hard
3Dpro: no change
FOLDpro: no model is good. use model 6 to replace model 1 according to visual inspection and verify 3d score. model_check score
of model 6 is a 5 points less (18)
use model 1 to replace model 5


T0351: hard
3Dpro:
	replace model 1 using model 6 (ab initio, becase it is a short protein)
	replace model 3 using model 1
	replace model 4 using model 8
FOOLDPRO:
	TRY TO get rid of model 2 and 4


	ab-initio predict two domains. but the protein is very small????

	compare model 1 and 7:
	Rmsd = 3.5Å Z-Score = 4.1
	Sequence identity = 91.1%
	Aligned/gap positions = 79/34
	compare model 1 and 3:
	Rmsd = 4.6Å Z-Score = 4.7
Sequence identity = 84.5%
Aligned/gap positions = 97/5
	comapre model 1 and 5
	Rmsd = 4.1Å Z-Score = 4.4
Sequence identity = 81.3%
Aligned/gap positions = 91/29

	compare model 1 and 2
	Rmsd = 7.6Å Z-Score = 2.0
Sequence identity = 23.4%
Aligned/gap positions = 64/52

	comapre model 3 and 5
	Rmsd = 4.8Å Z-Score = 3.9
Sequence identity = 55.7%
Aligned/gap positions = 88/30

	comapre model 3 and 7
	Rmsd = 6.4Å Z-Score = 3.3
Sequence identity = 68.8%
Aligned/gap positions = 80/47

	compare model 5 and 7
Rmsd = 4.2Å Z-Score = 4.4
Sequence identity = 95.2%
Aligned/gap positions = 84/15


	******FINAL DECISION********
	use model 6 as model 2
	use model 7 as model 4
	
	domain: single domain due to length limit if no strong evidence existing to favor two domain.
	(<130, single domain)
	****************STILL NEED TO THINK THIS CASE HARD, FR MODEL ALSO PREDICT TWO DOMAINS, BUT
	CAN 30-40 RESIDUES BE TREATED AS ONE DOMAIN?????????????????


T0350: hard
not change to make.

---------------------------------------------------------------------------------------------------------------
T0349: hard
3Dpro:
	run 1: model 5 is very bad. use model 6 to replace model 5. wait for the second run to finish.


	replace model 4 with the model 1 of foldpro. done.

FOLDpro:
	both run finished.
	model 6 looks pretty good.
	use model 6 as model 1
	use model 1 as model 2
	also need to use 3Dpro (model 2 by 1ZHV: different alignment using same template
	can generate better models!!!!)
	use model 2 of 3Dpro as model 4


T0348: easy ?
3Dpro:
find one match: 1PFT which is also a short protein.
the generated stx is not very compact though.
SCOP: SMALL PROTEIN, ZINC BETA-RIBBON, trascriptional factor domain
CATH: mainly beta, single sheet. 
It is a NMR stx.


T0347: hard target
FOLDpro: 
domain is classified into two domains by fr. Should we use ab-initio one domain prediction because the
FR mode is not good?  (POST THINKING AFTER PREDICTIONS ARE PUBLISHED ON THE WEB, svm score =-0.25, should at least use FR as guide)
check dompro raw file, it does predict a "TTT" at position 137 and 162 respectively.
DECISON:
	USE DOMPRO SINGLE DOMAIN PREDICTION because FR prediction is not good at all.
---------------------------------------------------------------------------------------------------------------

T0346: easy target
3Dpro:
use easy models 1-3 as model 3,4,5. keep original model 1 and model 2
FOLDpro:
use easiest as model 1

FOLDPRO:
use easiest as model 1
cm model as model 3
easy model 1-2 as model 4-5

compare model with easiest:
Rmsd = 0.9Å Z-Score = 7.1
Sequence identity = 100.0%
Aligned/gap positions = 172/0
compare easy 1 with easiest:
Rmsd = 1.3Å Z-Score = 6.6
Sequence identity = 99.4%
Aligned/gap positions = 164/16
compare easy 2 with easiest:
Rmsd = 1.3Å Z-Score = 6.6
Sequence identity = 99.4%
Aligned/gap positions = 164/16
compare mode 2 with easiest:
Rmsd = 0.8Å Z-Score = 6.9
Sequence identity = 95.2%
Aligned/gap positions = 166/5


T0345: easy target
3Dpro: use easy 1-4 as model 1-4. use original model 1 as model 5.
compare easy model 1 with cm model 1:
Rmsd = 0.2Å Z-Score = 7.3
Sequence identity = 100.0%
Aligned/gap positions = 182/0
FOLDpro: use easy 1-4 as model 1-4. use original model 1 as model 5.

T0344: hard target.
3Dpro: use model 6 to replace model 2
FOLDpro: exchange model 1 and model 2
resubmit model 5 due to model loss.

--------------------------------------------------------------------------------------------------------------
T0343: hard target
(alpha and beta protein, sheet is buried).
3Dpro:
use model 4 as model 1, use model 7 as model 4, use model 8 as model 5.
FOLDpro:
Exchange model 3 and 1. use model 7 as model 2.


T0342: hard target  (NEED TO VERIFY LATER IF IT SHOULD BE CUT INTO TWO DOMAINS. 50 RESIDUES OF ALPHA HELIX??)
FOLdpro: 
one template  2G0QA is found, but only cover one fragment.
cm only find one domain. so domain combination is hard.
maybe use fr or use the full length of original template?
according to the secondary stx prediction of the last 50 residues,
the template has strand, oop, one long helix.
the target has three helices. so the second part of the template
doesn't match significantly with the target?

1)now make a human alignment to use the full length of the sequence.
human model is done. predicted gdt score: 44.
2)submit second part to public foldpro server.

3Dpro:
cm found nothing.
but fr found the following templates:
1	1VKBA	1.08
2	2G0QA	1.04 (same as found by cm of foldpro)
3	1XHSA	0.99
4	1V30A	0.81

DECISION:
FOLDPRO:
	USE MODEL 1 TO REPLACE MODEL 2
	USE MODEL 7 TO REPLACE MODEL 1.
	DOMAIN: undecided yet. (leave it alone? one domain?)
	compare model 1 and model 7:
	
	Rmsd = 1.7Å Z-Score = 5.3
	Sequence identity = 77.0%
	Aligned/gap positions = 100/30
	The last 50 residues are not well aligned as we expected.
3Dpro:
	compare 3Dpro model 1 with the model 1 of foldpro
	Rmsd = 2.7Å Z-Score = 5.3
	Sequence identity = 79.4%
	Aligned/gap positions = 107/18

	compare 3dpro model 1 with model 7 of foldpro
	Rmsd = 2.9Å Z-Score = 6.0
	Sequence identity = 91.6%
	Aligned/gap positions = 131/16


T0341: easy
3Dpro:
use easy model 1,2,3 to as model 3,4,5.
FOLDPRO:
the tail dangling region is assigned to domain 2 (should be domain 1, a but in parsing perl script. to fix).
the first segment of domain 1 is too short. manually adjust it.
use easy model 1,2,3 as model 3,4,5.
compare cmfr model with easy model 1:
Rmsd = 0.6Å Z-Score = 7.6
Sequence identity = 99.2%
Aligned/gap positions = 245/4


T0340: easy
FOLDpro:
use easy model 1-4 as model 1-4.
use orginal model 1 as model 5.
3Dpro:
use easy models 1-4 as model 2-5.


T0339: easy target

FOLDPRO:
domain is hard to determine (2 or 3 domains??)
CATH: 2 DOMAIN. SCOP: 1 domain. 
So let's do two domain prediction. 
domain 1: 1-13/288-end. domain 2: other.
use easy model 1-4 as model 1-4
use original model 1 as model 5
3DPRO:
use easy model1 to 4 as model 2-5.


T0338: easy target
3Dpro templates: 
	1OKVB: two domains (two orthogal bundles)
1AIS: two domains (two orthogal bundles)
So this protein is two-domain protein.
the front end and back end may be disordered. 
PDP parse non-continuous domains and classify residues at front and end
into other domains? how we handle this???
The real bug is in our parsing scripts:
pdb classify two domains: 146-246 and 22-145. so we assign domain
1 to 146-246, domain 2 to 22-145. but the front end is assigned domain 1.
that is why we see crossing. 
If domain parser first order the domain segments, this problem can be avoided.
************************VERY IMPORTANT*********************************************
THIS IS A BUG IN parse_domain.pl. WE NEED TO FIX THIS BUG LATER.
FOR NOW, WE NEED TO MANUALLY VERIFY DOMAIN PARSING USING PDP PROGRAM AND
FIX PROBLEMS IF BUG HAPPENS. I THINK THIS PROBLEM HAS HAPPENED BEFORE,
AT THAT TIME WE DON'T KNOW THE REASON AND DIDN'T IDENTIFY THE PROBLEM.
************************************************************************************

DECISION:
3DPRO: REPLACE MODEL 4 WITH EASY MODEL 2.

FOLDPRO: 
change domain model to let front/back ends belong to domain 1 and 2 respectively.
use easy model 2 of 3dpro to replace model 4 of foldpro.


------------------------------------------------------------------------------------------
T0337: easy  (CANCELD, EARLY RELEAST)
3Dpro:
2 domains (alpha domain and a+b domain)
anyway: replace model 3, 4, 5 with easy models 1, 2, 3.   (to do)


T0336: hard target (CANCELED, EARYLY RELEASE)
3Dpro:
find several significant matches with similar folds.
1OYZ: one domain (both CATH AND SCOP), repat alpha. 
1Q1S: same
1EE4A: same
2BPT: not classified yet, but similar
1XM9: armadillo repeat domain
1TE4: same
1JDH: same

foldpro:
replace model 2 with model 7 (to do)
the top template: 1W63Z is not really good. 
it is lower resolution and in different fold according to
SCOP.
We should remove it to regenerate model????
compare current model 1 of foldpro and 3dpro:
Rmsd = 3.7Å Z-Score = 6.6
Sequence identity = 91.1%
Aligned/gap positions = 214/32

remove 1W632 to regenerate a model and compare it to 
the model 1 of 3dpro. 
a new model is generated: model check score: 0.37, slightly lower than model 1 (40).
compare it to 3dpro model 1:
Rmsd = 3.4Å Z-Score = 6.6
Sequence identity = 82.2%
Aligned/gap positions = 213/48

DECISION: 
USE HUMAN MODEL TO REPLACE BAD MODEL 2 and LEAVE MODEL 1 AS IT.
------------------------------------------------------------------------------------
T0335: hard target (pretty hard)
3Dpro:
replace model 5 with model 8

Foldpro:
exhange model 1 and model 4
replace model 3 with model 7
resubmit model 5 due to error.
---------------------------------------------------------------------------------------------
T0334
psi-blast found two very significant templates:
2AQJ and 2ARD. evalue of both is 0. 2AQJ has better resolution (1.8 vs. 2.6), 2AQJ has higher ientity rate (0.54 vs. 0.53), 
higher positive rate (0.73 vs. 0.7), lower gap rate (0.02 vs. 0.05). to check 
the structure of both. We should use 2AQJ as model 1. Since gap is very small,
we should not use cmfr.
Model 1: 2AQJ
Model 2: 2ARD
Model 3: combine of them
model 4: others.
We probably don't need to use FR at all. 
Use CE to compare two templates:
Rmsd = 0.6Å Z-Score = 8.1
Sequence identity = 100.0%
Aligned/gap positions = 517/0
They are almost exactly same. So, just use 2AQJ as template for model 1. 
2ARD has one more small gap (probably due to  disorder or high b-value?).

According to visual inspection, it looks like a a+b single domain protein.
pdp and pdb also classify the protein into one domain.

3Dpro:
easiest is put on the top. 
compare model 1 (easiest) and model 2 (cmfr):
Rmsd = 0.4Å Z-Score = 8.4
Sequence identity = 99.8%
Aligned/gap positions = 517/16
compare model 1 with model 3( cm)
Rmsd = 0.4Å Z-Score = 8.4
Sequence identity = 99.8%
Aligned/gap positions = 518/14
compare model 1 with easy 1:
Rmsd = 0.6Å Z-Score = 8.4
Sequence identity = 99.8%
Aligned/gap positions = 519/12
DECISon: 
	Replace model 4,5 with easy model 1 and 2. 
FOLDPRO:
Replace model 4, 5 with easy model 1 and 2. 
********************************************************************************************
			A NEW START (SINCE JUNE 16, 2006, SECOND HALF OF CASP7)
********************************************************************************************
VERY HARD RULE: for template with evalue < -100 or -120 or -150, identity rate > 0.5, the
top 1 template should be used only as long as it has good resolution.

				HALF CASP MILESTONE

			NEW FEATURES (HALF CASP IS GONE)
June 16, 2006.
Now I have implemented the protein stx modeling based on CM model using only single
top ranked templates and fragments supplemented by other templates if possible.
So, in future, we will use these models to replace many bad models in 
regular generation.
Especially, for very significant templates (e < -90 and cover > 0.9 or 0.85), we should use the top
single template as the best model.

I also adjust the cm and fr options (combinations) to reduce max linker size to 3 or 5 for both foldpro
and 3dpro. (effective since target T0334)

I also adjust the e-value difference for significant combination to 5 from 10. Thus we are going to 
combine less, but more close templates in future.

We still need to be very careful about dangling region (>25 residues). we need to 
either extend the alignments on the same template or drag fragments from other 
templates. (so we still need some human intervention if necessary)

another lesson:
	the model check score is good at discrimnating good models from bad models. (score diff > 15)
	but it is hard to discriminate best models from good models ( score < 10)
	So model check is still reliable, but don't expect to rank the best models on the top always.
Lessons:
Use more templates (when there are very significant match), is not always good (T0291)
Have a large dangling region (due to less complete local alignment of psi-blast), it can
cost a lot of gdt-ts scores. (T0293)


For ranking now, we need some human intervention. At the same time refer to the
model check scores, e-value, svm-score, template resolution, visual inspection. 

T0291:
blast info:
temp_name, length, score, evalue, align length, identity rate, positive rate, gap rate
top 2 (two chains of same protein): 1JPAA: evalue is -153, cover rate: 0.89, identity rate: 0.74
no. 3: 2SRC, evalue: -143, align length=285, identity rate=0.42.
The difference of evalue and identity rate is very large. so we should not we no. 3 and below.

	*********************************************************************
IMPORTANT IDEA:
TO DO THE BEST IN THE EASY COMPARATIVE MODELING, WE NEED TO ADD ONE EXTRA LAYER TO THE 
PIPELINE. WE USE BLASTP TO BLAST DATABASE WITHOUT PROFILES TO IDENTIFY VERY EASY TEMPLATES.

IF THE COVERAGE > 0.85 AND WITHOUT VERY BIG GAPS (>20 RESIDUES) AND RESODULTION < 2.5 AND EVALUE < -90,
WE GENERATE A MODEL AND THE MODEL SHOULD BE PUT ON THE TOP. THIS MODEL IS USUALLY THE BEST MODEL. 
	THUS, OUR PIPELINE WILL HAVE FOUR LAYERS: BLAST, PSI-BLAST, FOLDPRO, AB-INITIO.
	**********************************************************************
Let's do it now. done.
for T0291, blast easily find 1JPA for T0291.
1JPA resolution is 1.91, evalue: -134, ind=0.74, cover ratio= 0.87 
(actually, the residues of not covered area are not evaluated because they are disordered (coordinates
are missing).
for the model generated from blast alignment, we got score 88, better than 78 of combination, 
close to the best 91. For the model and alignments: seee /home/jianlinc/eval_casp7/easy
not add the easy_main.pl to the web server and test. 
for T0290: evalue of blast is only -82, so no easiest model is generated. but this model is still
pretty good. but later psi-blast model is also pretty good.
for T0293: not significant templates found by blast
for T0295: found: e=-117, ratio: 0.99, resolution = 1.9
295 is an interesting example: frcom (score is 82) is better the first model (cm, score is 74)
and psi-blast is using one template same as the one found by blast.
it is interesting to compare their scores. blast score is 75.64, psi-blast is 74.5.

so set final evalue to -100. 
------------------------------------------------------------------------------
T0333
very easy cm.
generate a cm model only using the top template. (running on mine4).
done.
both 3dpro and foldpro:
use easy model 1-4 to replace model 2-5. for the top, I still use the
multiple templates so far. 

for foldpro:
make one try:
the top 1 model (template resolution is 1.8)
the no 2 model (tempalte resolution is 2.8)
so combination model (cm.pir) may not be as good as the top 1.
So decide to exhange model 1 (cm model) and model 2 (easy model 1). (leave 3dpro unchanged for comparison later).
and compare model 1 and model 2 (very similar):
Rmsd = 2.2Å Z-Score = 7.0
Sequence identity = 80.9%
Aligned/gap positions = 335/82

T0332: easy target
foldpro: replace model 3 with model 6. done.

-----------------------------------------------------------------------------------------------
	NEW IDEA
FOR VERY EASY TARGETS, WE DON'T NEED FOLDPRO. JUST USE CM. BUT WE ARE GOING TO GENERATE MORE
MODELS USING THE TEMPLATES IDENTIFIED BY PSI-BLAST. ONE COMBINATION, THEN MODELS USING
TOP RANKED TEMPLATES RESPECTIVELY.
(NEXT CASP).

FOR CASP7, WE USE MANUAL TWEAKING. GENERATE MODELS FOR TOP RANKED TEMPLATES.
APPLIED ON TARGETS SINCE T0332. 
	6/15/2006.


-----------------------------------------------------------------------------------------------
T0331
FOLDpro:
	exchange model 1 (frcom) and model 2 because model 1 has some knots. (to do)
3Dpro:
	use model 2, 3, 5 to replace model 1 and two other bad models. (to do)

T0330
resubmit model 1 due to error. done.
T0330TS137_1  PIN_336812_18259  1127-6715-8809  06/14/06 17:07:03 pfbaldi@ics.uci.edu 

--------------------------------------------------------------------------------------------


--------------------------------------------------------------------------------------------
Two post analysis problems:
(a) T0293:
	dangling resion is not aligned. so the score is lower. we need to fill the short psi-blast alignment
	manually.
(b) multiple templates doesn't necessarily help improve accuracy.
	so we need to restrict the number of templates by changing the evalue of cm ( to -5)?
	to do????

**************************************************************************
		MIDDLE-TERM LESSON:
VERY VERY VERY IMPORTANT LESSONS:
In the future, IF WE SEE UNALIGNED REGIONS IN CM, WE NEED TO PICK 
SIGNIFICANT FR ALIGNMENTS TO FIX THE HOLE MANUALLY BY CHECKING
THE XXX.FR.PIR FILE. THEN USE THE SIMPLE COMBINATION TO COMBINE
IT WITH CM OR TAKE THE PORTION OF MISSING TO FIX THE HOLD.
40 RESIDUES HOLE CAN COST US 15 GDT-TS POINTS.

		T0293 IS THE LESSON.
ALSO, WE CAN SIMPLY TAKE THE TEMPLATES FOUND BY CM AND EXTEND IT (OF CHECK ITS
ALIGNMENT IN FR TO FILL THE GAP.
IN ANY CASE, I DON'T ALLOW LONG DANGLING REGIONS.
**************************************************************************

-----------------------------------------------------------------------------------------------


T0330: easy target
3dpro: template 2GFH, two doamins: 1-108, 108 to end.
1jud: two domains according to cath, 1 domain according to SCOP.
Surprisingly, the same templates (1jud and 2gfh are chosen for 
T0329 as well?)
------------------------------------------------------------------------------------------

T0329: easy target
foldpro:
	template 1JUD. scop classify it into one domain. cath 2 domains.
	visual inspection -> 2 domains.
This is a two domain protein. FR find a lot of templates. but the templates
on top seem to only match one domain of T0329. Thus predicted GDT-TS score 
is only half of cm model. This also reflect the problem of global alignment
in stx generation (for wrong domain architecture, alignment can't be correct).
but (in Edgar's Current Opinion Stx Bio, June/July, 2006, he mentioned a paper that can 
handle non-matched domain architecture in alingment. maybe we want to try that 
method).

T0328: easy target
3Dpro: 
single domain according to visual inspection and PDB. The template of 
this protein is very new (released in May, 2006). 
model 3 is a bad model. but since we have a very good model on the top,
and models 6-8  are also very bad. 
final decision: replace model 3 with model 6.

------------------------------------------------------------------------------------------
T0327: hard target

3Dpro rank 1XMA top 1, foldpro rank it 2nd. 

considering model check score too, 
for foldpro, we need to rank 1XMA on the top (exchange model 1 and 2).


----------------------------------------------------------------------------------------------------
T0326: easy target
3Dpro:
template: 2GHR
visually inspect cm model, it is single domain protein. but domain parser 
cut it into two domains.
PDB also classify it into 1 domain protein. 

Decision:
3Dpro:
	exchange model 1 and 2 according to model check score and visual inspection (exhange cmfr and cm_ab).


-------------------------------------------------------------------------------------------------------
START FROM TARGET 323, WE ONLY COMPARE RESULTS OF FOLDPRO AND 3DPRO WITH 
OUT BACKUP. FOR OTHER SERVERS, WE ONLY CHECK THE EXISTENCE. 

--------------------------------------------------------------------------------------------------------
Lesson: (overcut on T0321 hard target?)
domain prediction for T0321 (hard target?). since it is a hard target (score = -0.2?),
the fr stx is not very compact and confident. domain parser overcut the protein into
two domains. In this case we need to consult dompro, meta domain, and pdb template
to make final decision

The lession is that for the hard target, we also need to carefully check the domain prediction
and check if domain parser makes a reasonalble cut.

another lession learned from T0319 is the single domain limit should be set to 140.
for all the targets less than 140 residues, the domain number is set to 1 anyway. 

--------------------------------------------------------------------------------------------------------
To do:
1. add model check to the online web server script to generate gdt-ts scores for each model.
done. 

2. disable the automatic sending of domain model of foldpro? (yes)
done. 

3. disable the automatic sending of 3d models of foldpro? (no)

Since next week, we are going to use model_check as the main tool and references to rank models
especially for models with negative scores. Visual inspection is used to removed apparently
bad, very loose models.

--------------------------------------------------------------------------------------------------------
T0325: hard target
3Dpro find two positive templates. the top 1 (1V6T) generate a tim barrel stx. According to SSpro,
the target does have intervening helix and strands. but the middle (some region), only
helices are predicted, there is no strand. Of course, there are loops in the middle.
so the ss eveidence is not very obvious. According to SA, there are intervening 
buried and exposed fragments. So it could be a TIM-barrel. But there is a knot generated
probably due to a large loop. Now regenerate models using fr1.pir (to see if the knot 
disappears).
there is still a knot. but I might use the new human model to replace the model 1. 
(need to compare them to make sure they are very similar). 
FINAL DECISION: leave 3Dpro as it. 

FOLDpro:
put 1p49 and 1v6t as top 2. we need to let 1v6t as model 1 according to the consensus of 3dpro
and foldpro.
decision: exchange model 1 and 2. also regenerate domain model.

Apply model check on stx of 1v6t: score: 44 (3dpro), stx of 1p49, score is less than 20. 
for ab-initio, the score is 40 (so the models generated by ab-initio usually has high
scores, probably due to enforcement of SS. but it doesn't mean it has very high gdt scores.
it is a bias. We can train a model to predict gdt-ts score for ab-initio model only). 


T0324: easy target
foldpro:
two domains. domain 1(1-15,82-end), domain 2(16-81). 
3dpro:
	model 2 (frcom) is bad,  models 3 and 4 are bad. 
	leave as it.
This target is one example that fr yields different second domain as cm. fr predict
two aba domains. cm predict one aba, one helix. according to sspro secondary stx prediction,
cm is right.

for two-domain protein, this problem can happen on fr because fr is using global alignment
. if only one of two domains match, it will use two domains anyway.


T0323: easy target.
3Dpro: 
template: domain 2 of 1MPG (SCOP is single domain), CATH classifiy it as two orthogal bundles?
maybe it is really two domains: 1-150 (domain), remainder (domain 2) or (31-150, 1-30/151-end)
domain is really hard to determine for this protein. I prefer two domains currently.
FOLDpro:
replace model 4 with model 6. for domain (the last dangling region is classified to 
domain 1 again: 1 - 2 - 1 -2). change it to 1-2-1 format. (this is a problem of 
domain parser, we need to modify script to handle this).

T0322: easy target.
3Dpro:
cm_ab (model 2) looks better than cmfr. should we exchange them??? the first 18 residues of cmfr
is not well predicted (a stick and cause three clashes). 
compare model 1 and model 2:
	rmsd=0.9A, z=6.7, ind=100%, aligned/gap = 139. 
decide: exchange model 1 and model 2. 


T0321: hard target
leave as it.


T0320: easy target
foldpro: 1sur can cover the first two-thirds of the target,
1zun: can cover the first two-thirds of the target. (the last part of 1zun has not structural info)
so we need combination from other source to fill this protein gap. The covered part 
is single domain. 
submit the second part to foldpro: public. done. (maybe some postive will come out)
This protein may be a two-domain protein.

according to cmfr model of 3dpro: domain 1: 1-230, domain 2: remainder.

according to foldpro-public: the second domain is an ab-initio domain.

-------------------------------------------------------

T0319: hard target
DOMpro predict as two domains even though the protein is small, but I will stick to it.
to do: use model check to evaluate and rank models.
FOLDpro: 
use model 5 as model 1 (according to model check score and visual inspection)
use model 1 as model 2 
use model 6 as model 5
done.
Change mind. It is more like a single domain protein. done. (note: meta server also
predict single domain)

T0318:
large protein. from 80 to end has a match (mult-domain)
template 1LAM: two domains. 
scop: domain 1 (1-159): Macro domain-like, domain 2(160-484): Zn-dependent exopeptidases
cath: domain 1: aba, Leucine Aminopetidase, domain 2: aba, Amino peptidase, zn.

human prediction:
domain 1: 1-166 (the first 57 residues are not aligned)
domain 2: 167-end (aba domain)

Human: 
FOLDpro: two significant templates: 2EWBA and 1LAMA, the first 31 residues are 
not used. so we can manually to add these residues to cm alignment to generate
a more complete stx. Running................
the new stx is good. the domain 1 is improved.However, there is knot. we 
need to adjust it.  now, make a slight adjustment to the alignment file, and 
regenerate a human2. now human2 doesn't have a knot.

3Dpro: cm model has the domain 1 mostly correct. but one strand is missing 
from the beta-sheet. the alignment of one template is different from FOLDpro
that is why the stx of domain 1 is somewhat different.

compare the human model (human 1) of foldpro and cm model of 3dpro:
rmsd=2.9, z=7.8, ind=85, aligned/gap = 450/51.
compare the cm model of foldpro and cm model of 3dpro:
rmsd=2.7 z=7.8, ind = 91.7, aligned/gap = 444/38
Looks like cm model of foldpro is more close to cm model of 3dpro.
Should we use human prediction of foldpro.
compare human model (human 2) of foldpro and cm model of 3dpro:
rmsd=2.5, z=7.7, ind=86, aligned/gap=450/49.
compare human 1 and human 2 of foldpro:
rmsd=1.5A, z=8.2, ind=100, aligned/gap=483. 
compare human 2 of foldpro with cm of foldpro:
rmsd=1.4, z=8.0, ind=99.8, aligned = 430/10 (the first 50 residues are not aligned).
so the quality of human model is better than cm model of foldpro. 


3dpro human: 
try to only use one template in 3dpro (the longest one) to regenerate 
a model to see. (but its resolution is 2.5 lower than other templates (1.5)).
compare 3dpro human (single tempalte) with cm of 3dpro:
rmsd=3.3, z=7.1, ind=74.3, aligned/gap=439/76. it's quality is not as good
as the cm model.

Decision
FOLDpro: use human 2 as the model 1. cm model can be used to replace other bad models.

3Dpro: cm model as model 1, the human model can be used to replace other bad models.

some fr models are prettry good. (DALI alignment, not CE alignment as above)
FOLdpro: fr based on 1LAM (model 4) and model 5 have better domain 1.
compare human1 of foldpro with model 4
domain 2: from 167 to the end is completely aligned. 
domain 1: from 100 to 166, there is one residue shifting. from 15-100,
not well aligned, but there is still stx similarity. 
rmsd=1.8, aligned residues=462, z-score=57, ind=68. 
compare human 2 of foldpro with model 4
domain 2: aligned=323(almost completely aligned), rmsd=1.9, ind=97, aligned=323, z=57.
domain 1: aligned=138, rmsd=2.4, ind=4, z=18.8 (there are some shifting)
compare human 2 with model 5 of foldpro:
domain 2: 170-end, z=52, aligned=324, rmsd=1.4, ind=94
domain 1: alingned=90, rmsd=3.3, ind=6, z=6.4. 
compare human 2 with model 1 of foldpro (cmfr):
from 50 to end is almost compltely same: rmsd=2.0, z=62, aligned=449,ind=94, 
the front end has some similarity.
compare human 2 with model 2 of foldpro (cm):
rmsd=2.1, ind=98, aligned=449,z=63,from 40 to end is completely alinged. 
compare human 2 with model 3 of foldpro (frcom)
rmsd=2.7,z=54, aligned=438,ind=71 (from 180 to end is completely same),
first domain is not well aligned. frcom has a small knot (not very serious)
compre model 4 with model 5:
domain 2: from 165 to end: rmsd=1.9, ind=97, z=53
domain 1: z=8.1, aligned=120 (from 40-167), rmsd=3.0, ind=13. 

compare model 4 of 3dpro and model 4 of foldpro:
rmsd=1.4, z-score=59, aligned=470, ind=90.

compare model 4 of 3dpro with human model of foldpro:
second domain is compleltely same. 

FINAL DECISION:
1. 3dpro: not changed.
2. use model 4 of 3dpro to replace the model 1 of foldpro. The key reason is
to improve the domain 1 since second domain is almost same. 

post analysis:
compare 3dpro model 1 with sp3:
rmsd=2.5, z=7.8, ind=76.7, aligned=447
compare foldpro model 1 with sp3:
rmsd=1.1, z=8.0, ind=77.2, aligned=464/26.
-------------------------------------------------------------------------
T0317: easy target
CDD templates: 1VHR, alpha-beta protein, single domain.

FOLDPRO:
	model 1 is excellent.
	model 2 (frocom) and model 3 are bad. but no other good models to replace. just leave it.

*****************************LINUX COMMAND****************
f you start a long-running task and forget to add the ampersand, 
you can still swap that task into the background. 
Instead of pressing ctrl-C (to terminate the foreground task) and 
then restarting it in the background, just press ctrl-Z after the command starts, 
type bg, and press enter. You'll get your prompt back and be able to continue with other work. 
Use the fg command to bring a background task to the foreground.
**************************************************************


T0316: hard target in terms of domains
cm finds a match from 1-270, but no match for region from 271-440. 
to see if fr can find the second region. 
the template 1VL2 is about 421 residue long and has two domains. 
Should both domains be used for this target? Or only use the first
domain as psi-blast did? Looks like this protein is two-domain.
both domains of 1VL2 are alpha beta according to PDB. 
domain stx of 1VL2 is complex: 1-170/380-421 -> domain 1. 172-370->domain 2. 


according to SSpro, the first half is a alpha-beta domain, the second 
half is a completely beta-sheet domain. so, psi-blast is probably right
only the first domain of 1VL2 should be used. 

*****************************************************************************
		NEW TRICK TO HANDEL MULTIPLE DOMAINS
trick: submit domain 2 separately from foldpro. If some positive templates
are found, we will combine it with cm templates. done.
if it is ok, we will use ~/jianlinc/modeller.sh to generate stx from combined
alignemnts later. 

results from FOLDPRO (public),no significant match, but 1CQAA can be used??
Rank    Name    Score
1       1CQAA   -0.52
2       2FHXA   -0.82
3       1O9YA   -0.82
4       1D7YA   -0.85
5       1WH0A   -0.86
6       1LFOA   -0.89
7       1G5UA   -0.9
8       1MK0A   -0.9
9       1ACFA   -0.94
10      1F2KA   -0.94
*****************************************************************************
3Dpro identify: 
	1B37A: two domains: two FAD/NAD-binding domains (polyamine oxidase)
	1RSG: 3 domains
	1VBK: two domains
	1S3E: two domains: FAD/NAD-binding domain, FAD-linked reductase C-terminal domain
	2BXR: large, more than 500 residues? 
	1C0P: two domains: nucelotide-binding domain (oxidase), FAD-linked reductase C-terminal
			domain.
	
	1Q15: two domains: adenine nucleotide hydrolase-like, Ntn hydrolase like
	1O5W: two domains: FAD/NAD-binding domain, FAD-linked reductases, c-terminal domain
	1RU8: two domains: both are adenine nucleotide alpha hydrolase-like
	
	accroding to FR: some classify them as FAD/NAD, some as adenine hydrolase, some as others.
	not consistent at all?
	
3Dpro cm:
	1VL2: two domains: adenine nucleotide appha hydrolase-like, argininosuccinate synthetase
			only first domain is used.
	1gpm: three domains, only domain 2 (central) is used: a denine nucleotide alpha hydrolases-like
		(is a aba architecture according to CATH).
foldpro cm:
	1k92: two domains: adenine nucleotide alpha hydrolase-like, argininosuccinate c-terminal domain
	1VL2
	foldpro cm and 3dpro cm are consistent.

	foldpro cmfr take a portion from 1N4W for the second domain: (from 331 to 445 of 1N4W)
	unfortunately, the portion of 1N4W is not used successfully (probably droped by modeller)
	so the second part is just a stick.
	try to regenerate cmfr to see that happens.

	frcom: is completely coils. 
	
Human intervention (foldpro: /var/preserve/web/htdocs/T0316)
	1. regenerate cmfr:
		the last 7 templates are revmoved by modeller.

	remove those templates that cause problem and regenerate. 
		that is cmfr_new.pir and new models are generated in cmfr_new
	a cmfr_new model is generated. 
		the second domain is not very good. but there is a anti-paralle beta-sheet.
		
	this model can be used. 

	2. make a human.pir and generate stx.


	model 1 and model 3 must be replaced
	domain prediction need to be revised.

Human intervention (3dpro: /var/preserve/web/htdocs/T0316)
	1. regenerate cmfr: ok. 
		the first domain is ok. but the second domain is very bad. 
	
	2. make a new alignment by combining cm and the second domain generated by FOLDpro(public)
	use simple_gap_comb.pl to combine them.
	human.pir,
	models are in ./human/. done.
	looks better. 
	
domain prediction: 1-245 (or 240), 246-end.

compare the current model 1 of foldpro and 3dpro (only domain 1 matters), 
rmsd=3.2, z=5.9, ind=56%, aligned/gap = 182/59. They are significantly matched.

decision:

use its own human as the model 1 of foldpro and 3dpro respectively

use cmfr_new of foldpro as the model 2 of foldpro. 

use human of foldpro as model 3 of 3dpro, use human of 3dpro as the model 3 of foldpro.

resubmit domain prediction for foldpro.
-----------------------------------------------------

T0315: easy target
cdd template: 1J6O, tim-barrel, single domain.
FOLDpro: 
model 5 need to be replaced by model 6 (to do)
3Dpro: ok. 


T0314: hard target
sspro: alpha-beta(a few)-alpha protein.
3dpro: prediction is ok. 
foldpro: also put 1ksh on the top.
foldpro: model 6,7 seems to be better than some of model 2-5.
maybe use model 6 and 7 to replace model 4 and 5.
and it is a single-domain protein.

verify3d for foldpro:
model 1 = 6.78
model 2 = 17.43
model 3 = 11.59
model 4 = 8.16
model 5 = 11.15
model 6 = 11.16
model 7 = 27.9

decision: use model 7 to replace model 4

-----------------------------------------------------------------------------
T0313: easy target
cm model is very good. frcom is ok.
fr1-3 are very bad (model 3-4).
*************************PROBLEMS***********************
 Why mostly the templates ranked 
on the top are not good when there are many significant match?
Why the best templates are usually (no. fr 3, 4, 5)? in this case
it is models 6,7,8? the top ranked templates are short? so it
only cover a short stretch. So we really need to have a model
selection method or we really need to consider alignment length
or we need to rerank significant match using psi-blast?
this is an issue we need to address in the future?

model 2 (frcom) has a lot of clashes.
This is an important case to research this problem. 
**********************************************************
decison:
foldpro: model6-8 to replace model 3-5. resubmit model 1 due to error.
all top 5 templates are single domains. so domain is ok. 

3dpro: model 1 need to resubmit due to error. also model 3 and 5 are bad.
use model 7 to replace model 5. (no substitue for model 3 yet).

--------------------------------------------------------
T0312: hard target
mainly beta (sspro)
both 3dpro and foldpro find the same template. 
3dpro classify it as positive, foldpro classifies it as negative.
according to the appearance, 3dpro model looks better. 
maybe it generate better alignments???
need to compare the models 1 of both and decide which one is put on the top.

verify 3d score:
foldpro:
	model 1 = 27.6
	model 2 = 1.75
	model 3 = 1.75
	model 4 = 8.3
	model 5 = 38
	
3dpro:
	model 1 = 25
	model 2 = 15
	model 3 = 43 (ab)
	model 4 = 43 (ab)
	model 5 = 10
	model 6 = 30
	
compare model 1 of foldpro and model 1 of 3dpro:
rmsd=1.8, z=6.5, ind=88.6, aligned/gap = 132/6

decision: 
foldpro: use model 1 of 3dpro to replace model 3 of foldpro (to do). done.
3dpro: use model 6 to replace model 5 (to do). done.


---------------------------------------------------------------------------------------------
			AN IMPORTANT PAPER
An urgent task and a new idea:
It looks like that it is very important to have a post-modeling quality evaluation.
The quality of the model is not determined by templates, but the final generated
models whose quality are affected by alignments, template quality(3D), and so on.
For instance, some top ranked models are totally loose by visual inspection.
So it is very important to develop a model evaluation tool, especially for hard
targets.
So now I need to to do this:

Features:
1. ss match scores (%)
2. sa match scores (%)
3. 8A average contact probability (>=5 separation)
4. 12A average contact probability (>=5 separation)
5. average contact order (measure complexity)
6. average contact number (measure compactness)
7. verify3d score
Other factors to consider:
a) svm score (to do). not used now because it is specific to FOLDpro.
b) other energy measures (e.g. ab-initio, Arlo's program)
c) other tools (procheck, prosaII)
d) may include other pairwise energy terms: Lennard-Jones, Skolnick, and so on.
Use these features to regress on the gdt-ts scores divided by 100.
Train on hard targets, easy targets, and combine to see how well it can improve
performance.
In this casp period, we only test it on hard targets. 
So I need to build a dataset from casp 6 hard targets. Hard targets are the 
targets with SVM score less than a threshold (e.g. 0).  

If this system is ready, we can use this system + visual inspection + clash checking
to make final decision. I won't use clash feature in the quality evaluation. 
it is just used to exclude models with a lot of clashes.

And we only train our systems on the single domains of CASP6. A more practical way 
is to train the systems on the whole target. 
Finally we can release this software and server to users.

data generation:
we should use as many differnt models as possible. Ideally, we should use models
in CASP6/7. Now I just use models generated by FOLDPRO.

We need to report the following results:
1) square error
2) how does this evaluation improve ranking of each different methods (here, we evaluate on FOLDPRO models)

-------------------------------------------------------------------------------------------------------
T0311: easy target
3Dpro:
region: 10-65 has match. two ends are not matched.
FOLDpro:
no cm match is found.
This example shows that large NR doesn't mean higher psi-blast sensitivity.

both foldpro and 3dpro (fr) find a  lot of significant matches.
model 3 (frcom) of 3dpro is pretty good. It should be used as the first model of 3dpro
and also the first model of foldpro. the frcom (model 1) of foldpro has a long (30 residue)
unaligned region.

compare model 1 of 3dpro (cmfr) model with model 1 (frcom) of foldpro:
aligned regions: 15-71
rmsd=3.5, z=4.1, ind=68.6, aligned/gap=70/15 (the middle region of cm is aligned)

compare model 1 of foldpro and model 3 (frcom) of 3dpro,
rmsd=1.2, z=5.5, ind=100, aligned/gap=71/0 (from region 0 to 64)

compare model 1 and model 2 of foldpro:
rmsd=2.7, z=5.0, ind=85.9, aligned/gap=71.

decision to do: done.
foldpro: resubmit model 1-5 due to errors. use model 7 to replace model 3.

3dpro: use model 3 to replace model 1, use model 6 to replace model 3.
this is probably a case where frcom is better than cm because cm alignment is too short???

This target is also an interesting orthoganal helix bundle (not simple helix bundle as T0283)

----------------------------------------------------------------------------------------------


T0310: easy target
cdd template: 1BXSA
SCOP: 1 domain
CATH: 2 domains.
visual inspection: 1-269/473-500: domain 1; 270-472: domain 2. both domains are aba type. betasheet
is parallel.

domains from cm models:
domain 1: 1-214, 361-end
domain 2: 215 - 369

FOLDpro basically made the right prediction of domains. However, the last dangling 
regions are assigned to domain 1 instead of domain 2. This is a bug. Eventually
should be corrected in the code. Now need to manually correct it (to do).
current cutting is: 1 - 2 - 1 - 2 (the last 2 should be changed to 1 or set to -)
decision to set to 1. to do.
May the whole protruding region should be set to third domain. (but it will be two complex).
we still stick to two domains. Later need to check with casp final results.

compare model 1 of 3dpro (cmfr) and model 1 (cm) of foldpro using CE:
rmsd=4.1, z=6.9, ind=77.4, aligned/gap=376/48
---------------------------------------------------------------------------------------------

T0309: hard target
3Dpro: model 5 is very bad. should be replace by model 7 (or 6). 
verify3d score: 
model 1 = 1.1
model 2 = 24
model 3 = 21
model 4 = 7
model 5 = 1.1
model 6 = 8
model 7 = 4.25
decision: exchange model 1 and model 2, use model 6 to replace model 5. done.


FOLDpro: model 4 and 5 are very bad. should be replaced by model 6 and 7. model 2 need to resubmit due to error.
verify3d scores:
model 1 = 5.34
model 2 = 7.26
model 3 = 12.13
model 4 = 1.30
model 5 = 1.59
model 6 = 4.25
model 7 = 6.85
done.

--------------------------------------------------------------------------------------
T0308: easy target
CDD templates:
1F6B (aba alpha-beta protein)
1MOZ
1EOS

-------------------------------------------------------------------------------
T0307: hard target.
3Dpro: 
model 1 = 21
model 2 = 17
model 3 = 29
model 5 = 17. 
model 4 score is 4, 
model 6 score is 37 (highest, higher than model 1)
model 7 score is 47. 
decision: to do: use model 6 and model 7 to replace model 4 and model 5. 
models 6 and 7 are ab-initio models whose scores correlate with Verify3d pretty well.

******************************************************************************************************
Looks like we need to develop an algorithm to rank models when no positive templates
are found according to verify 3d score. Let's say we build 100 models (from 100 templates) using foldpro
then rank them using verify3d. 
next CASP, I try to simulate 200 models from 200 templates if we have enough computing power,
then use verify3d or ab-initio to rank these models and select top 5 for hard targets. 
This is a very simple approach. Hopefully we can identify all FR/A templates.
********************************************************************************************************
COMMENTS:
	verify3d may not distinguish good models from less good models, but it can identify very bad 
	models. We need to avoid put very bad models (score < 10) on the top. If another model has
	substantially higher score, it should replace very bad models.

-----------------------------------------------------------------------------------------
T0306: very hard target
CDD no.
AB-INITIO is not used because it was not predicted yet. 

3Dpro: model 1 is trash. model 2 is better. model 5 is very bad.
model 6 and 7 are better. 
Decision: use model 6 to replace model 1 and model 7 to replace model 5. done.
Model 1 and 2 are rejected due to errors. so model 2 need to be resubmited as well.
Use verify3d to evaluate models 1 to 7 of 3Dpro:
model 1 score = 1.1
model 2 = 1.62
model 3 = 7.5
model 4 = 1.75
model 5 = 1.1
model 6 = 14.28
model 7 = 10.02
model 2 error is due to the template is too new.
resubmit model 1 and 5 now.
Model 2 can't be submitted (too new template). replace model 2 with the model 1 of 3Dpro. done.
use one abinitio model to replace model 4. done. (ab-initio score is 26)

There is one local complexity region in template 1NAYA
Should we remove it??????????????????????? Hold on here. if it often appears on 
the top, it should be removed.

FOLDpro
model 1 score = 5
model 2  = 3
model 3 = 8
model 4 = 10
model 5 = 17
actions: FOLDPRO: exchange model 5 and model 1 because model 5 is much better.
use model 5 of FOLDPRO to replace model 2 of 3Dpro. done.

----------------------------------------------------------------------------------------------------
T0305: easy target
The prediction of cm and fr is not very consistent.
cm: lagely a a+b domain. according to pdb, templates are consistent (largely ab domain)
fr: largely aba domain.

according to PDB, it is single domain protein. 
FOLDPRO find a lot of positive templates, the best templates are not ranked on top and global
alignment could cause problems as well.
Stick to cm models.

for FOLDpro, pdb overcut the domains. alpha+beta is cut into one alpha and one beta.
We need to correct it.
--------------------------------------------------------------------------------------------------
T0304: hard target (3Dpro, ab-initio model is put on the first)
3Dpro: model evaluation using verify3d
	model 1= 51 (N/A model)
	model 2 = 35
	model 3 = 46
	model 4 = 15
	model 5 = 5.8

FOLDpro: 
	model 1 = 20
	model 2 = 24
	model 3 = 24
	model 4 = 11
	model 5 = 12
	
---------------------------------------------------------------------------------------------------
T0303: easy target
Problem:
	the topology of cm model and fr model is not consistent. 
	cm model has two domains: alph/beta + alpha helix
	fr model has two domains: both are alpha/beta.
Check cm models, all templates have consistent structure.
CATH consistently classify them into two domains: alpha/beta(aba) and mainly alpha.
Also consisent with PDB classification most time. SCOP sometime classifies some
templates into one domain, sometime into two domains. 
So we stick to the CATH classification. 
SSpro prediction match very well with secondary stx of tempalte 2GFH. 
need to resubmit model due to chain error. 

The reason of frcom is that the fold recognition is dominated by one domain.
Many templates that match only one domain of the target are selected. the non-matched (aba) domain
of the templates are used to model one domain of the target. This cause the problem. 
This problem is due to two reasons: a) lobster global alignment b) we use full chain instead 
of individual domain to build template library. 

This example show that cm model and local blast alignment is better than global alignment 
in this case. 
------------------------------------------------------------------------------------------------
A new FR/A fold recognition approach:

Protein fold recognition using pairing, connection, relative position of secondary structure
elements (especially designed for hard targets that can't be identified by sequence approaches)
	1. beta-strand pairing
	2. separation between secondary structures
	3. super secondary structure patterns
	4. secondary structure types, lengths, orders
	These information can filter out most proteins. 
	Then align secondary structure together and use energy to select final models.
------------------------------------------------------------------------------------------------
May 25
T0302: easy target
CDD templates: 1EMU, or 1AGR
probably it is a all alpha helix single-domain
protein. 


T0301: hard target
This protein has a lot of beta-sheets and some helix.
good news.
FOLDpro find two significant matches:
1W61 (res = 2.1A): Proline Racemase, 350 residues, single domain.
1TM0A (res = 2.8A): putative proline racemase, 350 residues, one domain. 1TM0A is an alpha+beta protein.
It turns out that two templates are very similar. 
Using CE alignment, rmsd=1.9A, Z=6.8, Seq ind=29.3%, aligned/gap = 317/42
in terms of size, T0301 is more close to 1W61.

Decision to make:
should we use fr1 as model 1 or use frcom as model 1.
frcom has too many clashes. 
we need to replace frcom using fr1.pdb.
3Dpro decision:
	use model 2 as model 1, use fr1.pdb as model 2.
done.

decision 2: 
FOLDpro also has the same problem. frcom has a lot of clashes.
so replace FOLDpro model 1 with the model 1 of 3Dpro. 

Also Xu' domain parser prediction:
Getting file pdb1w61.ent.Z from PDB...
2 domains have been found for 1w61a:
Domain 1: 186-366   (conf:>73.1)
Domain 2: 42-185;367-394   (conf:>45.8)


Also PDP also predict two domains. 
So we need to predict as two domains.
resumit domain model as 2 domains.

This target also has another interesting implication:
frcom only combine two templates whose structure are very similar according to CE.
but the combined model is not consistent, probably due to alignment inconsistency.
So even two structures are similar, if the alignment is not right, the combined
models are still wrong. Usually the wrongly combined model has a lot of clashes. 

--------------------------------------------------------------------------------
May 24
T0300
hard target.
An interesting, short protein.
DISpro predicts a lot of disordered regions
BETApro predicts only a number of pairs, but there are some long range
between 35s <-> 95s.
SSpro: three helix protein (helix is pretty long). there is one small
strand at the end. For this protein, we may want to use ABIpro as well.
Human prediction: helix-loop-helix-loop-helix structure.
----------------------------------------------------------------------------------
T0299: hard target.
SSpro: an alpha/beta protein
a lot of beta-residue contacts according to contact map predictor.
very little disorder.
ACCpro: beta-strands are buried.
dompro predict it as a single domain protein. 
meta domain predict it as single domain.

betapro: predict it has a mixed parallel and anti-parallel beta-sheet
consisting of about 9-10 strands.  So I predict it has a mixed
beta-sheet core and is a single domain protein. Use beta-strand 
pairing, we can predict the stx of this protein.
Human prediction: the protein is a 3-layer sandwich. the core is
a mixed or parallel beta-sheet surrounded by alpha-helices on both sides.

The model 1 of FOLDPRO doesn't look good. But I have no idea about how to 
handle this? should we use ab-initio, or should we change model order? and how???
---------------------------------------------------------------------------------
May 23, 2005
Boris said: it might be better to use a smaller, more representative library to
remove false positives. I think it might be useful to retrieve a lot of 
redudant proteins from the same family. Think about this. Since we have two
level modeling, a smaller FR library might be ok.
-------------------------------------------------------------------------------
T0298: easy target. (template 1GL3). 1GL3 is a two domain protein. doamin 1
is three-layer a/b sandwich, domain 2 is two-layer a/b. (1-135, 136-end?)

visually inspect the cm model, it has two domains: 
domain 1: 1-130 & 320-end): alpha/beta/alph sandwich
domain 2: 131-319: alpha + beta.
(very similar to the PDP paring. great!!!!!!!!!!!!!!!!!!!!!!!!)
comparing the stx generated by 3dpro and foldpro, rmsd=1.1, z=7.8
ind=99.4, aligned/gap = 329/4

LIke 296, this target also has a lot of helix and strands. but intervening helix(between
strands) is not regular. the distance between strands is not regular. so it doesn't 
form a beta-barrel. (this may be a method to distinguish barrel from non-barrels.
---------------------------------------------------------------------------------
T0297: easy target, single domain

-----------------------------------------------------------------------------------
SO FAR, T0283, 285, 287 ARE HARD TARGETS, SERVERS DON'T AGREE WITH EACH OTHER.

-------------------------------------------------------------------------------------
NEW IDEA:
MULTIPLE TEMPLATE COMBINATION HAS PROBLEM OF HANDLING INCONSISTENCY IN TEMPLATES.
ANOTHER IDEA TO IMPROVE MODEL GENERATIN / SELECTION IS:

A) USE ALL SIGNIFICANT TEMPLATES TO GENERATE A MODEL
B) CLUSTER THE GENERATED MODELS AND SELECT THE BIGGEST CLUSTER
C) SELECT A MODEL CLOSET TO THE CENTER AS THE FINAL MODEL????????????
OR USING WEIGHTED AVERAGE TO CREATE A MODEL?


----------------------------------------------------------------------------------------------
T0296: hard target. 
Large protein. find a match, but uncharactorized. also not stx found using CDD.
DOMpro: domain 1: 1-241, domain 2: 242-445
According to SSpro and ACCpro, the target has alternating helix and strand.
And strands are usually buried. So this protein must be a alpha+beta protein, where
the outside helices surround the core beta-sheets. the key issue is it is a two
domain or one-domain protein?

Hard targets. both 3Dpro and FOLDpro can't find positive templates.
Use model 1 of 3Dpro to replace model 3 of FOLDpro.

DOMAIN PREDICTION NEED TO BE VERY CAREFUL:
SO FAR META-DOM (INTERPRO, DOMSSEA, SSEP-DOMAIN) PREDICT ONE DOMAIN.

Human prediction: 
Analyzing the beta-strand pairings of T0296, this protein might have
12 or 13 strands which are paired in parallel. It might be a beta-barrel.
So we need to change domain prediction to 1. 

FINAL DECISION:
for 3Dpro, use model 1 of
FOLDpro to replace model 4 of 3Dpro. 

If the protein is not a beta-barrel, it must have a very large beta-sheet inside
that are packed by alpha helices.
I need to develop a tool to generate tertiary structure according to secondary 
stx elements and beta-sheet topology (strand pairings), or even turns and helix orentation.
One naive approach is to convert this topology into contacts, then use contact reconstruction

another approach is to directly set topology and fit in the stx.
I need to learn how to write program to set the backbone. 
Check if there some code existing or our own contact map reconstruction code.
This is a kind of tertiary programming. 

Final decision: 
I manually predict T0286 as a beta-barrel. So it is single domain protein. but there is maybe
a small dangling region (about 40 residues) at C-termini. But it may also be a 
disordered region. DISpro predict the last 30 residues as disordered. BETApro_contact and SVMcon also
prdict the contacts between residues around 20 and residues around 420. So it may be 
really a large beta-barrel.
But BETApro doesn't predict the pairing between strand 1 and strand (11/12/13). That means
for the prdiction of beta-barrel (the closing pair), contact map predictor may help. ---
considering develop a beta-barrel predictor using both strand pairing and contact.


This desion is difficult considering the size of the protein. 
The largest single domain protein in CASP6 is T0203 which has 365 residues.


Post analysis:
this is really a hard target. the predictions from all groups are not consistent.
SP3 seems to predict a good structure which consist of two similar domains.
Each domain is a alpha-beta-alpha. beta is a parallel beta sheet. 
If this is correct, then the protein should have two domains. the boundary is
in the middle. All domain predictors except for Robetta and another group predict
as single domain. DOMpro predict as two domains and boundary is in the middle.
so is DOMpro and Sp3 right? Did I make a mistake to change DOMpro prediction 
for a large protein (400 residues?)
-----------------------------------------------------------------------------------------------
			**********************************************************
			DECISION:
			USE 3DPRO MODEL 1 TO REPLACE FOLDPRO MODEL 1 (
			because we only want to use one template for this case)
			)
			before doing that, we need to copy models and 
			recompare model 1 to make sure copying is right.
			the Model 1 of FOLDpro is used as model 2 of foldpro and 3dpro.
			slightly adjust domain prediction domain 1: 1-178, domain 2:179-end
			**********************************************************
T0295
CDD blast: Dimethyladenosine transferase (rRNA methylation) (no pdb hit according to cDD)

easy target, find a number of very significant matches.

There is a strange issue here:
FOLDpro use two templates
3Dpro use one template. Due the difference between e-value (24), both should use only one
template, why foldpro uses two????
Reason is unknow. Be aware of this problem.
Now I compare the two models of 3dpro and foldpro using CE:
we get RMSD: 1.2, z-score = 7.7, seq ind=100%, aligned/gap = 275/0.
So two models are very very close, but not identical.
it is a two domain proteins: 1-179: domain 1, 180-275: domain 2.

for this kind of clear domains, dompro predict 1-161: domain1, 182-275: domain 2.


----------------------------------------------------------------------------------------------
	**********************************************************************
	*********************WEEK TWO: TWO LESSONS****************************
		******BE VERY CAREFUL ABOUT TWO DOMAIN COMBINATION******
TWO LESSONS SO FAR:
a)T0289: TWO DOMAINS, WE SHOULD USE FOLDPRO HUAMN FOR 3D PRO AS WELL
b)T0285: HARD TARGET. 3DPRO PREDICTION LOOKS BETTER AND VERIFY3D SCORE IS 26 TWICE AS MUCH AS 
T0289. I SHOULD HAVE USE IT FOR FOLDPRO AS WELL. AT LEAST, THAT MODEL SHOULD BE PUT INTO THE TOP 5
	LIST OF FOLDPRO.
c)T0293:	I found model 6 of 3Dpro T0293 is very good. it uses template 1xj5. 
	this model should at least be included, (some time maybe model 1).
	it only uses one template.
	so we need to visually inspect more models and change order if 
	visually better, svm-score is significant can close. 
	we should also remove apparently bad model in the top 5 and substitute them by
	model 6, 7, 8, and so on. 

d)lessons: when both cm and cmfr exist, sometime we want to add good template based on
single template, cm can be removed because it should be very similar to cmfr. 

e) T0293, should include model 6. should not use too many combination models.
Also the first fragment (1-60) should be consider a domain. domain is based 
on the stx, not based on the distance between two fragments.

f) when intervene domain prediction, we must be very careful. according to the domainparser,
the T0293 should have two domains, but I wrongly classifiy them into 1 domain manulally.
the first fragment is long enough to be considered a domain.
The key issue: when we consider a fragment just a tail end, when we consider it a 
indepdent domain. look at structure and also the length of fragment. (length >=35-40?)


LESSONS: WHEN FIND A GOOD STX FOR EITHER FOLDPRO OR 3DPRO, SHOULD BE USED FOR ANOTHER
DON'T BE AFRAID.  AT LEAST, IT SHOULD BE PUT INTO TOP 5 LIST.
			5/20/2006
	***********************************************************************
	***********************************************************************
----------------------------------------------------------------------------------------------
T0294: easy target: (we are similar to the best)
domain 1: 1-100: 1NAC (membrane ion channel-forming peptide), replaced by 1NRU????
domain 2: 101 - 328: 1DXY and 1GDH. 1DXY it self has two domains according to SCOP and CATH.
	1GDH also has two domains. 
So the total number of domains is 3. 

According to tertiary stx prediction, there are two domains containing non-continuous segments.
3Dpro: 1-104 and 296-328: domain 1, 105-295: domain 2.

post analysis
compare model against sp3 using CE
jigsaw: rmsd=3.0, z=7.6, ind=97, aligned=311
3dpro: rmsd=2.7, z=7.5, ind=91, aliged=307
foldpro: rmsd=3.2, z=7.4, ind=94, aligned=311
hhpred: rmsd=2.8, z=7.4, ind=94, aligned=312
metatasser: rmsd=3.4, z=6.0, ind=81, aligned=298
mGen3D: rmsd=2.2, z=7.5, ind=94, aligned=298,
Raptor: rmsd=2.5, z=7.5, ind=94, laigned=307
zhang: z=7.6
Sam: z=3.3: totally a failure???????????????

-------------------------------------------------------------------
T0293: (we are worse than the best.alignment is too short.........................)

	?????????????????????????????????????????????????????????????????????????????
QUESTION:
	WHY MOST GROUPS FIND 1T43 AS THE TEMPALTE?
	DOES EVERYBOY USE BLAST TO SEARCH FIRST? (INSTEAD OF USING PSI-BLAST)??????????
	LOOKS LIKE 1T43 IS THE BEST TEMPLATE. IT COVER THE WHOLE SEQUENCE............
	??????????????????????????????????????????????????????????????????????????????

Post analysis 5/22/06
Domain prediction seems to be ok. most group predict it as one domain except
that Baker's group consider a small chunk of fusion domain. so two ends of
dangling regions don't need to be considered a domain.
use CE to compare stx
foldpro-sp3: rmsd=2.3, z=6.0, seq ind=55, aligned/gap=177/64.
jigsaw-sp3: rmsd=3.0, z=6.1, ind=68, aligned/gap=196/71
hhpred-sp3, rmsd=1.6, z=6.7, ind=83, aligned/gap=206/50
karipis-sp3, rmsd=1.8, z=6.9, ind=90, aliged/gap=214/37
nfold=0.9, z=6.8, ind=87, aligned/gap=199/28
pcons->sp3, rmsd=2.3, z=6.7, ind=93, aligned=215
raptor<->sp3, rmsd=2.6, z=7.2, ind=89, aliged=239
sam-sp3, rmsd=4.3, ind=67, aligned=103.
zhang-sp3, rmsd=2.7, z=6.7, ind=86, aligned/gap=223.
SO OURS IS ONLY CLOSER TO SPARKS3 THAN SAM-T06.
We didn't generate the best alignment with the template.
Our alignment is simply too short.


Post-thinking on 5/21/2006.
It really looks like this protein has actually three fragments. 
one big fragment (51-210), another two fragments 1-50 and 211-250.
The key issue:
Are the two fragments at both ends are domains?
the fragment 1 may be a zinc finger domain?
3Dpro model 6 is so good that it should be included. 
According to this one, the fragment 1 should be a domain.
the last fragment is just a short stretch that can't be considered a
domain.
comparing model 1 and model 1(human), the core regions (61-200)
are aligned well. but the first domain is not. We should 
include at least model 1 which use the third template 1jx5. 
Final judgements: the protein should has two domains, not 
just one.
So HUMAN DOMAIN PREDICTION: 1-50, 11-250 (THE LAST END IS ALSO LINKERS)
According to this new domain definition, compare 1-60 fragment of
model 1 and model6 using CE: rmsd=4.0, z=2.3, seq ind=3.1, alinged=32.
so it is not well aligned. Basically, the similarity of domain 1 is
very low. LESSON: SHOULD NOT USE TOO MANY COMBINATION. SHOULD NOT LET
BIG DOMAIN COMPLETELY DOMINATE THE SMALL DOMAIN. 
MORE IMPORTANTLY, IN THIS CASE, OUR AUTOMATICALLY METHOD IN DEED CLASSIFY
PROTEIN INTO TWO DOMAINS (1-50, 51-END). domain 1 looks like a zinc 
finger. 


combination is hard???? is this due to PSI-BLAST ALIGNMENT (TOO SHORT)
OR DUE TO NEEDING TO TAKE FRAGMENTS FROM MULTIPLE TEMPLATES?

5/20/2006:
	I found model 6 of 3Dpro T0293 is very good. it uses template 1xj5. 
	this model should at least be included, (some time maybe model 1).
	it only uses one template.
	so we need to visually inspect more models and change order if 
	visually better, svm-score is significant can close. 


decision: FOLdpro: use human model 1 as model 1 and change domain to 1 domain.
decision: 3Dpro: use human model as model 1
3Dpro comparison:
	human <-> model 1: z=16.3, rmsd=4.2, ind = 91.2%, aligned/gap = 176
	human <-> model 2: z=16.3, rmsd=4.9, ind=87, aligned = 193
	human <-> model 3: z=8.4, rmsd =3.0, ind = 53, aligned = 144
	human <-> model 4: z=7.1, rmsd = 3.0, aligned = 133, ind = 47.
	human <-> model 5: z=8.9, rmsd=9.6, ind = 39, aligned = 150.
	
T0293: hard target in the sense of combination.
not very hard. 3Dpro find some tempalates, particaully for the central parts. the two ends
are not very well matched. 

visual inspection:
the second domain is not well predicted.

1ORI can cover 54-235 (incuding second domain). but is e-value is only e-18, so only small fragments
are included. We need to use it to predict a more coherent domain. 
we need to generate a human prediction for this (at least for FOLDpro)

Human in FOLDpro:
use cm_main_comb_join.pl cm_opt fasta file, output file to redo by setting
e-value threshold to -17, so 1ORI can be used.

another issue is: 2B3T aligned in FOLDpro is too short to be used, which cover the front end of 
the target in 3Dpro. 

Unfortunately, we add a lot of alignments, Modeller fail to generate a stx. a lot of alignments
must be removed. on FOLDpro, finally it generates a stx, but not as good as 3Dpro (cm.pdb).

Currently, the final 40 residues of 3Dpro is not well wrapped. The first 33 residues are slightly
better predicted. 

				*************To DO******************
FINAL DECISION:
take the first model (cm.pdb or cmfr.pdb) from 3Dpro and use it as the first model of FOLDpro.
				*************TO DO******************

***********VERY VERY VERY IMPORTANT:
Fortunately, 3Dpro generates an excellently stx with only last 30 residues not well predicted.
We must use this stx for both 3Dpro and FOLDpro. Also I can take 30 residue fragments from other
protein. Now I add one fragment from 1OR8A that is also found by psi-blast to generate a 
stx. but I can't add the whole 1OR8 because it will cause Modeller to crash. 

Compare this stx with the top 1 stx of 3Dpro and FOLDpro. If similar this stx should be the 
stx submitted as model 1 for both 3Dpro and FOLDpro.

the alignment file is: T0293.human.pir, pdb file is: T0293.human.pdb
in mine 3: /var/preserve/prosys/web/cgi-bin/work/114796762319404-3d/human/out/out

DOMAIN ASSIGNMENT: 1 domain.

************************************IMPORTANT PREDICTION*********************************************
For this protein (visually inspect cm.pd of 3Dpro), I predict it is single domain protein. 
Residue 145 to 250 also form a betasheet (four strands) and two helices that should be joined 
with the resiue 1-145 domain. Unfortunately, I can't adjust the stx manually to make 
them intergrate together. 
*******************************************************************************************************


--------------------------------------------------------------------
T0292: an easy target: Serine/Threonine protein kinases, catalytic domain. Phosphotransferases of the serine or threonine-specific kinase subfamily. 
1JNK: scop classified it into one domain, CATH classifies it into two domains. pdb classifies it
into one domain. 
DOMpro: predict single domain
FOLDpro: classify into two domains (1-86, 87-end).

According to visual inspection, it could be one or two domains.

Domain prediction is ambiguious. So just stick to FOLDpro. 

----------------------------------------------------------------------
T0291
easy target
CDD classification and function:
Tyrosine kinase, catalytic domain. Phosphotransferases; tyrosine-specific kinase subfamily. 
Enzymes with TyrKc domains belong to an extensive family of proteins which share a conserved 
catalytic core common to both serine/threonine and tyrosine protein kinases. Enzymatic activity of 
tyrosine protein kinases is controlled by phosphorylation of specific tyrosine residues in the activation segment of the catalytic domain or a C-terminal tyrosine (tail) residue with reversible conformational changes.
1FGI: two domains (two protein-kinase like folds according to scop), cath also two domains.
1IR3: one protein kinase domain according to scop. cath: two domains.
DOMpro predict as two domains
------------------------------------------------------------------------
T0290: easy target
cyclophilin_ABH_like: Cyclophilin A, B and H-like cyclophilin-type peptidylprolyl cis- trans isomerase (PPIase) domain. This family represents the archetypal cystolic cyclophilin similar to human cyclophilins A, B and H. PPIase is an enzyme which accelerates protein folding by catalyzing the cis-trans isomerization of the peptide bonds preceding proline residues. These enzymes have been implicated in protein folding processes which depend on catalytic /chaperone-like activities. As cyclophilins, Human hCyP-A, human cyclophilin-B (hCyP-19), S. cerevisiae Cpr1 and C. elegans Cyp-3, are inhibited by the immunosuppressive drug cyclopsporin A (CsA). CsA binds to the PPIase active site. Cyp-3. S. cerevisiae Cpr1 interacts with the Rpd3 - Sin3 complex and in addition is a component of the Set3 complex. S. cerevisiae Cpr1 has also been shown to have a role in Zpr1p nuclear transport. Human cyclophilin H associates with the [U4/U6.U5] tri-snRNP particles of the splicesome.
1M63: single domain according to cath and scop.
DOMpro: single domain
-------------------------------------------------------------------------
		comparing to SP3, REBETTA, RAPTOR.
		our second domain not well predicted. second domain looks like a beta-barrel.
		we find the correct template, 2BCO as sp3, but apparently we didn't get second domain
		well aligned using psi-blast. apparently, FOLDpro (human) is better than 3Dpro.
		we should replace both, in terms of domain orentation. but since it is evaluated by
		by domain, this should be ok. 
		sp3 <-> 3dpro: rmsd: 2.1, z=6.5, seq ind: 65%, aligned/gap = 183/32. (only first domain is aligned)
		sp3 <-> foldpro: rmsd = 2.3, z=6.8, ind = 48.5, aligned/gap = 266. (align both domains)
		sp3 <-> Karipis: rmsd=3.3, z = 4.9, ind = 61.3, aligned/gap = 150/51.
		sp3 <-> raptor: rmsd = 2.0, z=7.0, ind = 77, aligned = 280.
		sp3 <-> bayeshh: rmsd = 2.6, z=7.0, ind=71, aligned/gap = 279. 
		sp3 <-> hhsearch1: rmsd=3.0, z=6.5, ind=84, algned = 280. 
		THIS TARGET INDICATE THAT WE HAVE CHALLENGES TO GET ALIGNMENT OF TWO DOMAIN PROTEIN RIGHT.
		WE MAY NEED TO BUILD ALSO A COMPLEMENTARY LIBRARY BASED ON SCOP SINGLE DOMAIN PROTEIN WHICH
		CAN MAKE THE ALIGNMENT EASIER. ALSO THIS IS ALSO A PROBLEM OF TEMPLATE COMBINATION OR 
		IF WE USE ONLY ONE RIGHT TEMPLATE, BUT GET FULL ALIGNMENT, THIS MAY EASIER.
		also means PSI-BLAST can't generate very long alignments. 


		ANOTHER TRICK IS:
		USE THE TEMPLATES FOUND BY PSI-BLAST, BUT GENERATE ALIGNMETNS USING LOBSTER, THEN GENERATE
		STX FOR TWO DOMAIN PROTEINS IF WE BELIEVE THE TWO DOMAINS BOTH SHOULD BE USED, BUT PSI
		BLAST ONLY GENERATE A LOCAL ALIGNMENT FOR ONE DOMAIN.
		
T0289
decision:
	intervene the FOLDpro, leave 3Dpro as it is.
	only take the templates in frcom.pir that appears in cm.pir
	regenerate stx and put the model as the first model.
	using a script ~/modeller.sh pir_file output_dir 
	then convert pir to pdb:
	prosys/pdb2casp.pl pdb_file pir_file model_index output_file

	then compare the new pdb with the casp1 to casp5 to make final 
	decison.
	to do.
	
	currently for foldpro:
	289_3 (frcom) is good.
	289_1: (not good, the domain 1 is too small)
	289_2 (cm: only one domain)
	
	strategy:
		1. generate a human from frcom, use it as model 1.
		2. move current model 1 to model 2
	T0280.human is created.

	compare it to model 1 of 3dpro: DALI: z-score: 19.4, alignmed res: 207, rmsd: 21 seq ind=53%.
	CE: rmsd: 2.1, z-score: 6.6, sequence ind: 62.8%, aligned = 191.

	so it is similar enough. 

	decide: submit the new human as model 1. 

        ****************************************************************
	domain decision: 1-210, 211-313.
	resubmit as well
	****************************************************************
  

3Dpro: frocom/cmfr is a two domain prediction.

another approach is to regroup templates in from,
remove incosistent templates and regenerate stx.

group I:
Looks like 1YW4 is a very good template. but the domain definition is
very ambiguious for this case. according to PDB. this protein is a two 
domain proteins.
1UWY is also two domains. (1-296, 297-403)
1H8L two domains. (domain classification is same as 1YW4 in GO)
2G9D: one domain (but looks like two domains as 1YW4).
2BCO:  Succinylglutamate desuccinylase (2BCO:A, B)  	
    * hydrolase activity, acting on ester bonds 
    * metabolism 
1Yw6.
template 1UWY: two domains

group II:
template 1O5W: two domains
template 1QYD: fold is different from 1UWY.
the folds of 1UWY and 1O5W is different. (next combination should check scop
or stx clustering or construct a phylogeny tree)

3Dpro and FOLDpro finds a lot of significant templates.
Now the key issue is alignment.
Lobster generate a profile-profile full length alignment that cover all
regions of T0289 for 1UWY, 1O5W, 1H8L, 1YW4, 1QYD, 2G9D. 
Let's see how cm and fr are combied. if a lot of small fragments are added
, it won't help. A more clever combination is to select biggest fragment
from a number of templates. So the order of combination consier both ranking
and also the fragment contribution. (TO DO FUTURE).
---------------------------------------------------------------------------------------------
T0289: (easy)
Succinylglutamate desuccinylase / Aspartoacylase family (single domain????)
This is an interesting target.
3Dpro: profile includes 43 sequences
FOLDpro: profile includes 113 sequences.
But psi-blast in 3Dpro generate longer local alignments covering more regions.
Both find the same templates, but 3Dpro alignment covers: firt 171 residues.
FODpro only covers the first 102 residues.
That means larger NR database not necessarily yields longer, more signifcant
alignments. Let's wait to see what FR finds and how they are combined together.

Templates:
2G9D: Succinylglutamate desuccinylase. function class: hydrolase.
1YW4: Succinylglutamate desuccinylase. function class: hydrolase.
1YW6: same as above.
2BCO: same as above.
All from the same family and are consistent.

2G9D has a lot of disordered regions.

-------------------------------------------------------------------------
T0288 is an easy target.
a lot of matches from psi-blast.
--------------------------------------------------------------------------
TRICKS:
MANUALLY EVALUATE MODELS FOR HARD TARGETS USING
SA, SS, VERIFY3D, SVM RANK SCORE, and VISUAL
INSPECTION.
------------------------------------------------------------------------
T0287: hard target
Majority of secondary structure elements are helices.
It has three strands which are predicted to form an anti-parallel 
beta-sheet. For this kind of hard target, we at least can get this
beta-sheet right which is a lot of GDT-TS score. Then we try to
put a few helices in the right position.
---------------------------------------------------------------------------

T0286 is an easy target, identified by CDD. 

homologous to cd00229.3 (representatives: 1ESC, 1ESE)

SGNH_hydrolase, or GDSL_hydrolase, is a diverse family of lipases and esterases. 
The tertiary fold of the enzyme is substantially different from that of the alpha/beta 
hydrolase family and unique among all known hydrolases; its active site closely 
resembles the typical Ser-His-Asp(Glu) triad from other serine hydrolases, but may lack the carboxlic acid.

from CATH and SCOP (about 1ESC)
SCOP Classification (version 1.69)   	
Domain Info 	Class 	Fold 	Superfamily 	Family 	Domain 	Species
d1esc__ 	Alpha and beta proteins (a/b) 	Flavodoxin-like 	SGNH hydrolase 	Esterase 	Esterase 	Streptomyces scabies
CATH Classification (version v2.6.0) 	
Domain 	Class 	Architecture 	Topology 	Homology
1esc00 	Alpha Beta 	3-Layer(aba) Sandwich 	Rossmann fold 	HYDROLASE

both FOLDpro/3Dpro found related proteins in Flavodoxin fold.

compare models:
3dpro: model 1 and 3? aligned: 176, rmsd: 2.1, seq ind: 81%, z = 22. 
foldpro: aligned residues: 181, rmsd: 2.9, seq ind: 77%, z = 18. 
So those models are very similar. 

---------------------------------------------------------------------------
		DECIDE NOT TO INTERVENE ANYMORE UNLESS FIND SOME
		VERY OBVIOUS PROBLEM.

T0285: (hard target)
BETApro and SVMcon finds a lot of common contacts. 

285 is a hard target.
According to secondary stx, the two ends are two long helices. in the middle are
short beta-strands (a little helix).

3Dpro predict a alpha+beta protein (cool stx).

FOLDpro: the model is pretty loose. Model 2 looks better. Let's try to use
verify 3D to evaluate models.

3Dpro: verify3d ranking is consistent with svm ranking. model 1: 25. 

FOLDpro: model 1: 12, model 2: 15, model 3: 16, model 4: 16, model 5: 17. 
The svm score of model 1 and model 2 is very close. And Model 2 is also ranked
as third by 3Dpro. So decide to exchange model 1 and model 2 of foldpro. 
Exchange model 1 and 2. and also model 2 seems to fit secondary stx better. DONE!

compare model 1 of foldpro with 3dpro: there is a little similarity. compare
model 2 (used as new model 1) of foldpro with model 1 of 3dpro: there is no similarity
at all. that means, visual similarity is not reliable at all. 

FINAL DECISION: use the orginal model 1 and model 2. don't exchange. I guess it 
is a new fold. 

Human prediction of topology: 
BETApro prediction of four strands:
3--4:A:[89-92:101-98]:1.78
2--3:P:[62-65:89-92]:0.82
1--4:A:[39-45:104-98]:0.78
Key elements of the protein: 
H1 E1  E2 E3 E4 H2
I guess the topology is:
Strand pairings as above. H1 and H2 are also in parallel. So it is two layer protein. 
One layer is four strands, another layer is two helices.

Formulate an ab-initio protein stx prediction algorithms:
step 1: generate building blocks/fragment according to secondary stx and associate 
a flexibility score to each block (bending flexibility, switching flexibility
that the block was changed to other SS element, buried/exposed score). generate local
stx for these blocks. 

also predict the turning elements.

Step 2: Alignment beta building blocks. (may select a number of patterns according to beta pro)
Thus will generate a number of different trajectories.

step 3: align helical elements according beta-sheet and turns.

Step 4: MCMC refinement. adjust positions of elements or Ca atoms and select by 
energy function

Step 5: clustering stx using stx alignments. 
-----------------------------------------------------------------------------
T0284 easy
---------------------------------------------------------------------------
T0283 (hard)

3D: (models are ok) (four helix bundle)
FOLDpro and 3Dpro: 1P68A (same as sparks3)
FOLDpro also identify 1NI7, same as (raptor, forte, and meta-tasser)

visual:
	sparks3 predict four helix bundle.
	robetta-ab: three-helix bundle.

align against Robetta using Dali:
foldpro vs robetata: Z-score = 1.4, RMSD = 8.2, %= 44.
sparks vs Robetta: z=2.0, RMSD: 3.5, %=8.
foldpor vs sparks: z=3.5, rmsd=2.8, %=5.
abipro vs robetta: z-score = 2.6, aligned: 68, rmsd: 7.3, %= 6.

domain:
All except ginzu: 1 domain

Contact:

betapro: ok
distill: all short range contacts.
GPCPRED: >=8 separation
Pssum (Hamilton): longe range
PROFcon: >= 6
SAM: longe range
SVMcon: ok
------------------------------------------------------------------------------
T0284: (easy)
FOLDpro: model 3 (frcom), 3Dpro: model 2 (frcom):
Ca-Ca clashes (about 15 pairs) due to stx inconsistency in a lot of templates.
Solution:
a) we should have a Ca-Ca clashing detecting script given a model (Ca-Ca < 4.? Angstrom)
b) Ca-Ca clashes can be used to discard some model or used to discard some template
in model generation
c) we should use stx alignment to check consistency among top rank templates. choose
the largest and consistent cluster of templates to generate stx in future.

Next time:
	a) visually inspect, if clashes, we may discard the model and replace it with
		Model 6.
	b) or detecting Ca-Ca clashes, and regenerate model by selecting half of templates?
		or top five templates?
	To do: 
		write a script to check clashes.
		send alert email to me if clashes happens. 
	Clashes may give some advantage, but will be seriously penalized. 

From this paper: CASP6 data processing and automatic evaluation at the protein structure prediction center
Andriy Kryshtafovych, Maciej Milostan, Lukasz Szajkowski, Pawel Daniluk, Krzysztof Fidelis *

definition of geometric irregularities:
	dist: Ca-Ca distance
	irregularity: 0.1 < dist < 3.6 or dist > 4.0
	Severe collision: 0.1 < dist < 1.9
	same position: dist < 0.1
	They also check model similarity between predictions and identify similar or
	identical models.
done.

Another reason that causes clashs is that template is not good. 
	T0287, FOLDpro, model 1.

Mistake:
	for models with more than 5 (>5) clashes, or 1 server clash, the model 
	should be discarded. Use model 6 or 7 to replace it. 

Generate clash report. 

CASP6 penalization policy of clash:
hose models with greater than 50 bumps (where the C-C distances were between 1.9 Å and 3.6 Å) 
or that had more than 4 severe clashes (C-C distances of less than 1.9 Å) were penalized.
The choice of cut-offs was rather arbitrary, but also fairly generous. We checked a selection of 1000 chains from the PDB and found just one chain with more than 16 minor clashes. Penalized models were inspected manually and those that contained visible backbone-backbone clashes or were that were otherwise clearly unfeasible [Fig. 1(a,b)] had both their AL0 and GDT-TS z-scores set to 0. In total, 55 first models were penalized in this way. 
Reference: 
Assessment of predictions submitted for the CASP6 comparative modeling category
Michael Tress *, Iakes Ezkurdia, Osvaldo Graña, Gonzalo López, Alfonso Valencia

Comments: 
Many servers select 1MUM. FOLDpro select a lot, but 1MUM is not in top five shown
in the model file (ranked no. 6). But definitely is used in modeling. 1MUM resolution
is 1.9. 1S2V is 2.1 A.  they are in the same family. In the first round of pdb-blast, 1MUM
is ranked #2, in the second round it was ranked #6. I think most other people probably only
use use blast to search PDB to get 1MUM. 

Stx-stx alignment between SP3 for T0284 using CE:
Rmsd = 1.5Å Z-Score = 7.5
Sequence identity = 97.0%
Aligned/gap positions = 265/14

between sp3 and karpyris: (both are using the same template)
Rmsd = 1.5Å Z-Score = 7.7
Sequence identity = 98.9%
Aligned/gap positions = 273/5

sparks 3 and jigsaw (using Dali), jigsaw also use multiple templates.
z=34, aligned residues: 261, rmsd: 1.7, seq identity: 95.

sparks 3 and foldpro (using dali)
z=37.5, aligned residues: 275, RMSD: 2.7, seq identity: 96. 

stxs are very similar. 

Domain: FOLDpro, 1 domain, same as most others (such as meta-dp)

VERY IMPORTANT:
THE BETApro prediction for this target, the strands are almost completely correctly
predicted except for the pairing of the first and the last strands.
This means that our BETApro can be used to reconstruct this kind of protein structure.


------------------------------------------------------------------------
T0287: hard target
Fugue find 1V0D, same as 3Dpro (rank #1).
There is no other consensus.

Domain: 1 domain same as meta-dp. 

RR prediction:
	SVMcon ususuall has a few commons with SAM-T06.

Foldpro <-> Robetta: no similarity
Foldpro <-> sp3: z=1.0, aligned res:65, rmsd: 3.7, seq ind: 3%.
sp3 <-> Robetta: no similarity
3dpro <->robetta: z=0.2, aligned res: 63, rmsd: 9.2, seq id: 5. 
abipro <-> robetta: z=0.1, aligned res: 57, rmsd: 10.3,seq ind=4%.

all stxs are very different. 
-----------------------------------------------------------------------------------
T0283:  hard target
predict as four helix bundle

------------------------------------------------------------------------------------

#####################################################################################
Very important adjustment:
1. show up to 8 parents
	change pdb2casp.pl file.

2. if no chain id, only need to show the four-letter pdb code. don't need to add "_".
	FOUR LETTER PDB CODE IS GOOD ENOUGH.

######################################################################################