[Bioperl-l] genpept/swiss

hilmar.lapp@pharma.Novartis.com hilmar.lapp@pharma.Novartis.com
Mon, 4 Sep 2000 18:15:25 +0100




There also the famous statement of writing tolerent readers and strict
writers.

     Well, that's why I always have an uneasy feeling when I see the
     regexps requiring exactly x number of spaces at the beginning of
     tag=value lines etc. But that's what the spec says, so ...

     In general, I'm in favour of having readers as tolerant as possible
     (and sensible, of course).

About 4 years ago there was a statement in Nature by Hooft, et.al.
(see http://www.cmbi.kun.nl/gv/articles/ref5.html) pointing out that
the the PDB contained over a million errors and outliers.  Would
such an analysis and report of the error rates in existing sequence
databases be worthwhile?

     I think I see your point. What do you think of callbacks in this
     respect, as I mentioned in the previous response to Kris?

BTW, since the flat files are machine generated, you wouldn't think
there would be all that many problems, would you?  Or that the major

     I certainly wouldn't, and I feel a bit like making a fool of myself by
     trying to deal with the errors other people's writers produce, not
     least because it is like you also my freetime (of which there isn't
     that much).

     BTW some months ago I tested the swissprot and genbank parsers on
     complete SwissProt and GenBank primate section, respectively, and
     there wasn't a single entry that was obviously misformatted (of
     course, subtle format errors and senseless dates were not counted as
     such), so the error rate might have decreased a bit since 2 years ago.

I didn't follow what you said, so I don't know.  Part of the problem
may be that I don't know much about how complete the bioperl parsers are.

     Unfortunately, there are other very sad points, for instance some
     types of location (compound locations with cross-references, fuzzy
     locations) cannot be handled because the data model is not yet
     prepared for them. (This means that you e.g. lose the translation tag
     for those sequences, and since the CDS coordinates are not handled
     either, you basically cannot tell the correct translation.)

     Maybe it's a good time to bring up this painful discussion again: What
     do people think about a rewrite of the SeqIO parsers? What should the
     re-design provide for? Given the current maturity of XML
     representations of the major databanks (can anyone comment on this,
     that is, what is the maturity?), does it make sense to go directly for
     an XML mapping?  Do the advantages of such an approach justify the
     price in overhead (performance-wise)? Would it be realistic to limit
     future support (meaning maintenance) in BioPerl to XML dumps provided
     by the major database providers?

     And last not least: who would be volunteering to do what?

     Have the Ensembl people done some work in this direction that could be
     back-ported?

     I guess Ewan wants to comment on these questions...

          Hilmar