[Bioperl-l] genpept/swiss

Kris Boulez krbou@pgsgent.be
Tue, 5 Sep 2000 08:43:15 +0200

Quoting hilmar.lapp@pharma.Novartis.com (hilmar.lapp@pharma.Novartis.com):

> This describes exactly my situation in which I have to read in data in
> all sorts of different formats (and people's interpretations of these
> formats).
> The problem now is that BioPerl throws a warning if a sequence does not
> comply 100% with the standards and exits. While at that moment I want to
>      You mean it throws an exception. (Issuing a warning shouldn't cause an
>      exit.)
Yup, you're correct (I'm not so familiar with al this newspeak :) ).
When throwing an exception in the SeqIO system it might be handy to also
provide the part of the 'record' that has already been parsed and the
offending line.

> be able to say that he can ignore the warning if (e.g.) he has read the
> sequence correctly.
>      Does this sound like a call for a callback a client program can
>      provide? The question then is what should be passed to the callback
>      routine? The sequence object as it has been constructed so far? Sounds
>      fragile, and may be useless in many cases. The complete offending
>      source record? Would discard the parse done so far (for the callback),
>      and would require a partial rewrite of the parsers because they read
>      line-by-line (at least most if not all of the rich format parsers).
The line-by-line nature will indeed make it very hard if not impossible
to recover from, as some lines are already 'consumed' and the rest of
the 'record' isn't known at the moment of the offense (so you can't make
decissions based on what will follow).

I doubht however that we have to spend to much time on this problem as I
think (like other people have suggested in this thread) XML should be
used for this type of problem.

> Something that would be really nice to have is a more modular approach
> in which it would be easy to say:  'this data is in a format which is
> EMBL, with the following quirks, additional fields, ... '.
>      Yes. But this needs a careful design of how can you split up the parse
>      of a sequence record into subtasks that are a) fairly independent (and
>      can thus be overridden by your QuirkyEMBL parser), and b) common to
>      all (rich) formats. Anyone's done any work in this direction so far?
A minimal subset might be what is needed for FastA (seq, id and ev.
What I find myself doing now is 'cp embl.pm quirkyembl.pm', hack at the
quirks  and have '-format quirkyembl' for the SeqIO handle.
This of course isn't really what "code reuse" means, I guess :).