[Bioperl-l] Re: Bioperl Seq objects and strands/GenBank parse

Hilmar Lapp hlapp@gmx.net
Mon, 25 Sep 2000 08:34:39 +0200


Mark Wilkinson wrote:
> 
> http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=9755607&dopt=GenBank
> 
> This is the example we are using to test our DrawableBioSeq.pm module.
> 
> (1) those already mentioned below relating to a given top-level feature on the -1
> strand eventually being assigned to all 3 strands: -1 for its introns/exons, 0
> for its CDS_span and gene_span tags, and +1 with its gene tag.
> 
> (2) introns and exons are considered top level features, rather than sub-features
> of a gene.   I know (from bitter experience!) that writing GenBank parsers is a
> nightmare

Yes indeed. After looking at the entry you provided I think the most
important things have already been said by Ewan. A few clarifications
though in order for you to get a better idea of how the Genbank parser
works.

1) A feature having a key is considered a top-level feature. No
interpretation of the key whatsoever is being done. Hence, upon
encountering a feature with key "exon", no attempt is being made to
recognize that this could actually be interpreted as a sub-feature of a
preceding (sometimes also a following!) "gene" feature.

2) If the location of a feature is recognized as being a compound
location, a top-level feature is created, the location is split into its
parts, and every part is added as a sub-feature. Obviously, these
sub-features inherit the key (primary_tag) from the top-level compound
feature. In addition, "_span" is appended to the key. No interpretation
whatsoever is done about what the individual parts could represent.

3) The strand is determined by matching the location against
/complement/. If it matches, the strand is set to -1 and to 1 otherwise.
In the entry you provided the same gene on the reverse strand (but not
only those) is annotated multiply: once without complement in the
location, and a second time afterwards with complement and the full
compound location. In order to catch this, one would have to interpret
the qualifier tag "gene", because their values are identical for both
annotations, and I don't see another way for being able to merge both
annotations (which would still require that you trust one location more
than the other).

In other words, the entry you provided is a complete mess.

> 
> (3)  when you call the sub-SeqFeatures of a top-level feature the sub-features
> all have the primary tag "gene"...??? what the, hey!?!??  aren't these the
> introns and exons??
> 

See above.

> 
> P.S.  I just read Arlin's post to the group - it appears that this problem may
> well be intractable (and lies not in the parser, but rather in the source)
> 

I don't think it is intractable. In a sense, if you need data, you have
to take them where they are. However, parsing by a computer program
starts to be a *real* problem as soon as you need the *semantics* and not
only syntax. Whenever I can avoid it, I don't want to be the programmer
who has to interprete the semantics of an evolving language.

	Hilmar

-- 
-----------------------------------------------------------------
Hilmar Lapp                                email: hlapp@gmx.net
NFI Vienna, IFD/Bioinformatics             phone: +43 1 86634 631
A-1235 Vienna                                fax: +43 1 86634 727
-----------------------------------------------------------------