[Bioperl-l] genpept/swiss
hilmar.lapp@pharma.Novartis.com
hilmar.lapp@pharma.Novartis.com
Mon, 4 Sep 2000 18:15:25 +0100
There also the famous statement of writing tolerent readers and strict
writers.
Well, that's why I always have an uneasy feeling when I see the
regexps requiring exactly x number of spaces at the beginning of
tag=value lines etc. But that's what the spec says, so ...
In general, I'm in favour of having readers as tolerant as possible
(and sensible, of course).
About 4 years ago there was a statement in Nature by Hooft, et.al.
(see http://www.cmbi.kun.nl/gv/articles/ref5.html) pointing out that
the the PDB contained over a million errors and outliers. Would
such an analysis and report of the error rates in existing sequence
databases be worthwhile?
I think I see your point. What do you think of callbacks in this
respect, as I mentioned in the previous response to Kris?
BTW, since the flat files are machine generated, you wouldn't think
there would be all that many problems, would you? Or that the major
I certainly wouldn't, and I feel a bit like making a fool of myself by
trying to deal with the errors other people's writers produce, not
least because it is like you also my freetime (of which there isn't
that much).
BTW some months ago I tested the swissprot and genbank parsers on
complete SwissProt and GenBank primate section, respectively, and
there wasn't a single entry that was obviously misformatted (of
course, subtle format errors and senseless dates were not counted as
such), so the error rate might have decreased a bit since 2 years ago.
I didn't follow what you said, so I don't know. Part of the problem
may be that I don't know much about how complete the bioperl parsers are.
Unfortunately, there are other very sad points, for instance some
types of location (compound locations with cross-references, fuzzy
locations) cannot be handled because the data model is not yet
prepared for them. (This means that you e.g. lose the translation tag
for those sequences, and since the CDS coordinates are not handled
either, you basically cannot tell the correct translation.)
Maybe it's a good time to bring up this painful discussion again: What
do people think about a rewrite of the SeqIO parsers? What should the
re-design provide for? Given the current maturity of XML
representations of the major databanks (can anyone comment on this,
that is, what is the maturity?), does it make sense to go directly for
an XML mapping? Do the advantages of such an approach justify the
price in overhead (performance-wise)? Would it be realistic to limit
future support (meaning maintenance) in BioPerl to XML dumps provided
by the major database providers?
And last not least: who would be volunteering to do what?
Have the Ensembl people done some work in this direction that could be
back-ported?
I guess Ewan wants to comment on these questions...
Hilmar