repost - Re: [Bioperl-l] Huh? Bioperl Seq objects and strands
Mark Wilkinson
mwilkinson@gene.pbi.nrc.ca
Wed, 20 Sep 2000 09:43:29 -0600
http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&list_uids=9755607&dopt=GenBank
This is the example we are using to test our DrawableBioSeq.pm module. There are
several peculiarities that I have noticed while using the SeqIO-parsed sequence
entry which makes it *exceedingly* difficult to use the parsed feature data
'stat':
(1) those already mentioned below relating to a given top-level feature on the -1
strand eventually being assigned to all 3 strands: -1 for its introns/exons, 0
for its CDS_span and gene_span tags, and +1 with its gene tag.
(2) introns and exons are considered top level features, rather than sub-features
of a gene. I know (from bitter experience!) that writing GenBank parsers is a
nightmare because the order/presence of feature tags is not consistent nor
reliable... What made my job easier (though admittedly I have never solved this
problem satisfactorally...) is that I was writing parsers for ACEdb where the
flexible format of the .ace file itself takes care of some of the parsing
problems (you don'thave to necessarily parse all features of a gene in any
particular order, or remember where you are in the sequence file, as you simply
print to the .ace file with object:tag:value triplets as you go along... meaning
that the parser can be a bit more "dumb")
(3) when you call the sub-SeqFeatures of a top-level feature the sub-features
all have the primary tag "gene"...??? what the, hey!?!?? aren't these the
introns and exons??
If I am completely misunderstanding how the SeqIO parse is intended to be used,
or if I am missing a crucial bit of information that I should already have been
aware of please tell me where to get off!!!
To explain more clearly why this is frustrating our curent project: we are
attempting to build two graphical maps, one map contains full genes assigned to
the correct strand, the other shows the full complement of features "exploded"
into their different feature types (and assigned different colors etc. from this
information). In its current state it is difficult to get this information
"cleanly" out of a GenBank SeqIO parse. We could make our module sophisticated
enough to re-parse the SeqIO features (actually, that is already in progress)...
but it seems that it might make more sense to (at least partially) solve the
problem at the source and have the Seq object itself give a better representation
of the data it has parsed in... or?
I would be *more than willing to help* if you think I could be of any use!
(though too many cooks....)
Cheers all BioPerlers!
Mark "tired of Dave Block's use of exceedingly long and unnecessary middle name
designations ;-)" Wilkinson
P.S. I just read Arlin's post to the group - it appears that this problem may
well be intractable (and lies not in the parser, but rather in the source)
__________________________________________________________________________________
Hilmar Lapp wrote:
> Could you submit the sequence causing the mentioned misannotation. I have
> to add that I will probably not have time to look at this in more detail
> before the weekend. Maybe someone else wants to dig into it.
>
> I should mention that the feature table is parsed on a line-by-line
> basis, the features ending up as top-level features of the sequence
> together with their tags. There is no back-log kept, so if the reverse
> strand is not indicated immediately, there may be features for which the
> strand is not switched accordingly. If this is the case, it's certainly a
> bug.
>
> -----------------------------------------------------------------
> Hilmar Lapp email: hlapp@gmx.net
> NFI Vienna, IFD/Bioinformatics phone: +43 1 86634 631
> A-1235 Vienna fax: +43 1 86634 727
> -----------------------------------------------------------------
--
---
Dr. Mark Wilkinson
Bioinformatics Group
National Research Council of Canada
Plant Biotechnology Institute
110 Gymnasium Place
Saskatoon, SK
Canada