[EMBOSS] A question about CON entries
Guy Bottu
gbottu at vub.ac.be
Mon Feb 4 09:11:11 UTC 2008
Peter Rice wrote:
> When reading a CON entry we need a database to use to read the true
> sequence and features.
>
> If we are reading from a database we can add the information in the
> database definition.
>
> How do we define a default to resolve EMBL CON entries?
>
> Can we handle EMBL release and EMBL updates?
There are a number of practical issues :
- an entry with "join" information can come from a databank as well as from a file.
- EMBL and GenBank CON entries refer to segments in the same databank, but
RefSeq refers to GenBank.
- a sequence presented to EMBOSS can be CON or ANN type but have already a
re-assembled sequence (depending on where it comes from)
- each site has its own DB entries in emboss.default, so code that explicitly
says "search in embl" might not work
So, IMHO :
- We need code for two cases : embl format (for EMBL,...) and for GenBank format
(for GenBank, RefSeq,...). The software must look whether there are CO
respectively CONTIG lines in the entry, looking for CON in the ID line is not good.
- for databank sequences : the DB entry in emboss.default should have a
parameter that indicates in which databank to search for the segments. If a site
has RefSeq and EMBL but no GenBank, then RefSeq could still use sequence
information from EMBL. If there is no parameter in the DB entry EMBOSS could for
embl or genbank format entries search by default in the same databank or simply
not try the assembly (what do you think is the best ?).
- for "personal" sequences from files : is more tricky. Maybe an associated or
advanced parameter that says that if the input sequence is of "join" type it
must use a databank or file to retrieve the sequences. E.g. -sjoin=xxx or
-join=xxx. If xxx is a databank the seqgments can be retrieved using the
standard method defined in emboss.default and if xxx is a file it can be
searched sequentially.
There are still some issues :
- the program entret is for retrieving entries as they are rather then for
processing sequence information. Should entret also try the assembly or not ?
- feature information is another matter. Some entries have no or a very poor
feature information but there are entries that have features that are different
from the seqment entries (this is certainly so for the ANN entries in EMBL and
for RefSeq). How should we handle this ?
Guy Bottu,
BEN
More information about the EMBOSS
mailing list