[EMBOSS] A question about CON entries

Guy Bottu gbottu at vub.ac.be
Mon Feb 4 09:11:11 UTC 2008

Peter Rice wrote:
> When reading a CON entry we need a database to use to read the true 
> sequence and features.
> If we are reading from a database we can add the information in the 
> database definition.
> How do we define a default to resolve EMBL CON entries?
> Can we handle EMBL release and EMBL updates?

There are a number of practical issues :
- an entry with "join" information can come from a databank as well as from a file.
- EMBL and GenBank CON entries refer to segments in the same databank, but 
RefSeq refers to GenBank.
- a sequence presented to EMBOSS can be CON or ANN type but have already a 
re-assembled sequence (depending on where it comes from)
- each site has its own DB entries in emboss.default, so code that explicitly 
says "search in embl" might not work

So, IMHO :
- We need code for two cases : embl format (for EMBL,...) and for GenBank format 
(for GenBank, RefSeq,...). The software must look whether there are CO 
respectively CONTIG lines in the entry, looking for CON in the ID line is not good.
- for databank sequences :  the DB entry in emboss.default should have a 
parameter that indicates in which databank to search for the segments. If a site 
has RefSeq and EMBL but no GenBank, then RefSeq could still use sequence 
information from EMBL. If there is no parameter in the DB entry EMBOSS could for 
embl or genbank format entries search by default in the same databank or simply 
not try the assembly (what do you think is the best ?).
- for "personal" sequences from files : is more tricky. Maybe an associated or 
advanced parameter that says that if the input sequence is of "join" type it 
must use a databank or file to retrieve the sequences. E.g. -sjoin=xxx or 
-join=xxx. If xxx is a databank the seqgments can be retrieved using the 
standard  method defined in emboss.default and if xxx is a file it can be 
searched sequentially.

There are still some issues :
- the program entret is for retrieving entries as they are rather then for 
processing sequence information. Should entret also try the assembly or not ?
- feature information is another matter. Some entries have no or a very poor 
feature information but there are entries that have features that are different 
from the seqment entries (this is certainly so for the ANN entries in EMBL and 
for RefSeq). How should we handle this ?

	Guy Bottu,

More information about the EMBOSS mailing list