Bioperl: EMBL/GenBank parsing
hilmar.lapp@pharma.Novartis.com
hilmar.lapp@pharma.Novartis.com
Mon, 8 May 2000 14:59:19 +0100
I think we should have the following criteria for the format and
parsing
[...]
This suggests quite an overhaul of aspects to the parsing.
Basically, I agree with all points, as well as with the conclusion.
I would like to suggest that we have the following set up:
- a common base class for EMBL/GenBank/Swissprot parsing
We already have, namely Bio:SeqIO.pm. I am unsure what may justify an
additional intermediate layer for only some of the parsers (I'm sure these
three are not the only formats featuring feature tables). In addition, I
think inheriting from SeqIO is adequate, because that's what the parsers
are supposed to do: input from and output to a specific format.
- specific classes for each format *only* handle the parsing of
format in a producer/consumer type manner. The parsers essentially
provide objects whcih are
(tag1,tag2,@lines) with
tag1 being ID,CC etc
tag2 being empty for everything but Feature Table,
where they are the key
This looks to me like having a metaformat. If this is what you mean, the
parsers would still have to do almost the same thing, namely understand
completely the syntax and semantics of the format.
I'd think of something like factory objects, which require the basic parser
to have only a very basic understanding of the information contained, like
how records are separated, how attributes are separated (named attributes
in this case, so how to obtain the name, and how the value), etc. There
could then be a general factory (changeable by the user) which is passed a
name/value attribute pair, returning a factory that is capable of
converting the name/value pair into an object.
This way users can fully interfere with the way objects are built out of
the contents of files by providing their own factories building objects
their own way, and building even their own objects.
But maybe this is what you meant anyway.
- The common base class has hooks for parsers on hashes on tag1/tag2.
(if no specific tag2, default to tag1)
- Another object, being a ParserController which has attributes like
->throw_on_error
->warn_on_error
->skip_on_error
Why an object on its own? It could be a method in SeqIO as well, even a
class method won't hurt, and provides the possibility to set this before a
class may use this internally (like a database get_seq).
In general, I don't want at all to advocate against a well-designed object
model, but I think whenever possible the hierarchy should be kept rather
simple, specifically for the users, as long as the flexibility penalty is
low enough to be afforded.
This would suggest quite a rewrite of the parsers, but we would gain
in flexibility. I'd like to kick this proposal around for a while -
I am sure Keith and Hilmar will have requirements to be met for the
parsers and we want to make sure we don't make this too complicated
My requirement is that it works :) , and that not everything is hardwired,
so that I can plug in my own things should I so wish, without having to
rewrite then basically everything.
Meant as an initial contribution to the discussion, not an opinion thought
over for hours.
Cheers,
Hilmar
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================