Bioperl: EMBL/GenBank parsing

hilmar.lapp@pharma.Novartis.com hilmar.lapp@pharma.Novartis.com
Mon, 8 May 2000 14:59:19 +0100




I think we should have the following criteria for the format and
parsing

[...]

This suggests quite an overhaul of aspects to the parsing.

     Basically, I agree with all points, as well as with the conclusion.


I would like to suggest that we have the following set up:


- a common base class for EMBL/GenBank/Swissprot parsing

     We already have, namely Bio:SeqIO.pm. I am unsure what may justify an
     additional intermediate layer for only some of the parsers (I'm sure these
     three are not the only formats featuring feature tables). In addition, I
     think inheriting from SeqIO is adequate, because that's what the parsers
     are supposed to do: input from and output to a specific format.

- specific classes for each format *only* handle the parsing of
  format in a producer/consumer type manner. The parsers essentially
  provide objects whcih are
     (tag1,tag2,@lines) with

          tag1 being ID,CC etc
          tag2 being empty for everything but Feature Table,
               where they are the key

     This looks to me like having a metaformat. If this is what you mean, the
     parsers would still have to do almost the same thing, namely understand
     completely the syntax and semantics of the format.

     I'd think of something like factory objects, which require the basic parser
     to have only a very basic understanding of the information contained, like
     how records are separated, how attributes are separated (named attributes
     in this case, so how to obtain the name, and how the value), etc. There
     could then be a general factory (changeable by the user) which is passed a
     name/value attribute pair, returning a factory that is capable of
     converting the name/value pair into an object.

     This way users can fully interfere with the way objects are built out of
     the contents of files by providing their own factories building objects
     their own way, and building even their own objects.

     But maybe this is what you meant anyway.

- The common base class has hooks for parsers on hashes on tag1/tag2.
  (if no specific tag2, default to tag1)

- Another object, being a ParserController which has attributes like

     ->throw_on_error
     ->warn_on_error
     ->skip_on_error

     Why an object on its own? It could be a method in SeqIO as well, even a
     class method won't hurt, and provides the possibility to set this before a
     class may use this internally (like a database get_seq).

     In general, I don't want at all to advocate against a well-designed object
     model, but I think whenever possible the hierarchy should be kept rather
     simple, specifically for the users, as long as the flexibility penalty is
     low enough to be afforded.


This would suggest quite a rewrite of the parsers, but we would gain
in flexibility. I'd like to kick this proposal around for a while -
I am sure Keith and Hilmar will have requirements to be met for the
parsers and we want to make sure we don't make this too complicated

     My requirement is that it works :) , and that not everything is hardwired,
     so that I can plug in my own things should I so wish, without having to
     rewrite then basically everything.

     Meant as an initial contribution to the discussion, not an opinion thought
     over for hours.

     Cheers,

          Hilmar





=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================