[Bioperl-l] genpept/swiss

Andrew Dalke dalke@acm.org
Mon, 4 Sep 2000 00:17:48 -0600


Hilmar Lapp <hlapp@gmx.net>:
>Some of you may object to this

I'm one of those objectors.  If the format isn't right in one place,
how certain are you that the recovery is correct and didn't skip
important information?

A goal in Martel, my parser generator, is that it be easy to support
special cases.  For example, sprot38 contains an improperly formatted
record (N33_HUMAN - since fixed).  You can see the modification to the
grammer in http://biopython.org/~dalke/Martel/Martel/formats/swissprot38.py
(look for "bogus").  It's 6 additional lines of code.

>That is, is it possible to set the reporting
>level such that warn() actually becomes equivalent to throw()?

The other option is to fix the parser to handle those cases properly.
However, I'm looking at the bug report and can't tell what the problem is
with
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_u
ids=4033427&dopt=GenPept
so maybe I'm not the right one to comment.

>I think it is
>pointless for BioPerl to aim at clean and complete conversion from any
>rich format into another rich format for sequences.
>
>The only way this could be achieved with a reasonable effort is by
>mapping languages to a common meta-representation, like XML or ASN.1 (and
>anything the meta-format doesn't cover will still be lost).

I'm missing something here as well.  Perl supports complex data structures
which can model anything that XML or ASN.1 can.  If there is a common
meta-representation using either of those two specifications, then it
should be possible to describe it in perl.  That means that perl, and
hence bioperl, can potentially do a "clean and complete conversion"
up to information missing in a record.

The problem is actually two-fold.  First, it sounds like the data from
SWISS-PROT isn't being converted correctly.  That's something bioperl
(or biopython) might be able to address.  GenPept is a derived database
so with some effort it should be possible to recreate, and even
recreate better.  Or better yet, on demand look up the appropriate
records from the primary data source, instead of depending on the
translations.

Second, there may be semantic differences between the two which are
not intertranslatable.  That indicates a problem in the formats, in
that they can't be used to specify everything they need to do.
This is related to your point about automated translations between
two languages, but it isn't that complex.  If there's no way a human
can make a better data file (because there's no way to indicate
certain information) then a computer can't be blamed for its problems.

Otherwise, the problem is that people haven't put enough work into
making the programs understand the meanings of the different formats,
which is understandable - there's only so much time to work on
something, and 80% is often good enough.  Is is possible to make
the bioperl code smarter so it knows how to deal with these different
cases (other than just ignoring them)?  Is is possible to use
bioperl to write a better GenPept in the first place?  Would you
like to work on that code?  BTW, down that path lies many meetings
and arguments on what the "right" data structure/output is, so
watch out.  :)

                    Andrew
                    dalke@acm.org