[Bioperl-l] Insanity of Swissprot parsing, take 2.
Ewan Birney
birney@ebi.ac.uk
Thu, 22 Mar 2001 21:30:51 +0000 (GMT)
At Ensembl we are trying to make swissprot format for the swissprot
group. An interesting challenge as swissprot forces a massive amount of
information into generally not-so-great-for-computers text.
Latest issue. DR lines in swissprot have an optional count number for
domains. (of course, DR lines without this information do not have
anything here).
<outtake>
Swissprot of course is confounding here the feature table with the DR
lines in my view. The 2 here is nowhere near as useful as having the FT
lines, however, the FT lines don't necessarily store this information
about a particular domain database's matches, but the DR lines sometimes
do. Why? Well - this is one of the mysteries of swissprot as to make the
2 they should have got this in the feature table sort of information at
some point...
</outtake>
Eg:
DR Pfam; PF00076; rrm; 2.
Current in bioperl we have a DBLink object with
database (Pfam)
primary_id (PF00076)
optional_id (rrm)
Now - what should we do with this "2." Here are some options
(a) punt it back to swissprot and claim they should figure out some better
way of representing this information in text files. Let's face it - this
is not going to happen in a hurry if at all, even with the best will of
the people in swissprot
<outtake>
For wolfgang and henning. That said - ie, that we can't change swissprot
format in a hurry I think it would be good to start lobbying the powers
that be, ie, rolf and amos for some sort of 5 year plan about this.
I realise this rapidly becomes seriously difficult, and that it is bad
enough lobbying for more simple changes, so I just want to lend my voice
towards a general sanity-in-data-representation discussion, probably
keenest on a primary clean-data representation with derviative "text
views" representations.
</outtake>
(b) Mung the 2. into the optional_id so the string becomes something like
"rrm; 2."
Yuk. I don't like this at all, even though optional_id is left
deliberately open for interpretation I find this pretty repulsive. (ie,
lets represnent swissprot in objects - as a list of Line Objects. Not
a good representation)
(c) Inheriet a special SwissprotDbLink object. Like the
"to_FTHelper" system for features an optional "to_DRLine" method could be
detected and allow a object to present itself into the DR line as it
wished (maybe we should have an internal mini-object representation of
this line... hmmm...)
I prefer (c) but I do wish we didn't have this problem. Hey ho! I guess
this is the wonders of bioinformatics coming back to haunt us.
What do other people think?
e.
-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>.
-----------------------------------------------------------------