[Bioperl-l] swissprot is a pain - partly due to bioperl
Hilmar Lapp
hlapp@gnf.org
Fri, 4 Oct 2002 12:03:29 -0700
I thought I take on the smoothest ride first by dumping swissprot (rel. 40) into biosql. This turned out to be painful, and here's why and what I did and what else we need to do to relieve some of the pain.
I'm writing this line at the end because it became a long email. So I thought I better put a summary here.
<summary>
- Species and classification names choke the parser. I have a solution to fix the problem once and for all.
- Swissprot entry to species is a n:n relationship. The parser screwed up the species. I have a short-term fix, but generally speaking this is a total nightmare.
- 'Common name' isn't always a common name, but sometimes a strain or isolate, which is crucial for identifying the species. I propose a solution.
- Virus classification scheme is not handled properly, and I don't know how it should be. Need an expert.
</summary>
Read on to share my pain.
1) Species names not conforming to what we think in Bioperl they should should conform to. There are endless variants with ever new non-letter characters being used even in species name, especially for viruses and bacteria. What's really painful about this is that our name validators throw an exception (Elia, you were so right) and the parser chokes.
I honestly see no point in us trying to keep up with the fancy names of viruses and bacteria classifications, if in the end we have to trust the sources anyway. So, I decided to fix this problem once and for all by doing exactly that: $species->classification() in addition to the traditional array of strings will now also accept another form of being called in set mode: if the first argument is a reference to an array, the second argument is checked whether it evaluates to true. If it does, no name validation whatsoever is done. I.e., 'trust the caller.' I modified the swissprot parser accordingly. I.e., trust swissprot species and classification names, however weird they may read.
It works for me. Does anyone have a problem with me committing this? I also suggest that we modify the genbank, embl, etc parsers accordingly.
2) The swissprot people in their quest for non-redundancy apparently collapsed sequences which are the same for multiple species into one entry. This means, Species to Entry can be a n:n relationship. This not only caused the bioperl parser to silently (!) screw up the species, it also violates the bioperl object model. I fixed the screw-up such that at least the 'main species' (the one matching the ID division) is correct. At this point I ignore the other species.
One solution to this could be to add get_secondary_species() to Bio::Seq::RichSeqI. What are people's thoughts on this? Does anyone have an objection to me committing my fix?
Once we decided how to keep the other species, we could at least dump swissprot read from swissprot without losing significant information. But there's more trouble buried here: if you take a gene-centric viewpoint, you get one entry where you should have gotten 5, because the gene is present not in one species but in five, no matter how similar the sequence is. (In fact, if you search swissprot through the SRS gateway for YWHAE, you might think only humans have this gene. You have to visit and study the entry to find out that actually mouse etc have it too. For them, the division will disagree with the species. I guess the swissprot people have good reasons why they do this.) Because of this I strongly vote for not changing the fundamental bioperl object model (a sequence has one species); IMHO normalizing by protein sequence is a bad idea except for similarity searches.
3) Identifiability of a species. (Full) Binomial is not enough as it turns out, as for microorganisms different strains and/or isolates get different NCBI_TaxIDs. Also, the term in parentheses on the OS line in these cases does not indicate a common name (which is supposedly redundant with the binomial in terms of identifiability), but the name of the strain or isolate, and then therefore is a key part of the species' name (i.e., it's semantically overloaded). I propose the following to fix this.
- add an attribute variant() to Bio::Species, holding the un-interpreted value in parentheses if it appears not to be the common name. (e.g. 'isolate Gambia', 'PYSG', or 'strain PSG').
- pass the value in parentheses either to variant() to common_name(), depending on some magical logic ...
4) Virus classificatio. This is a whole other nightmare, and I'm not going to delve into it. But it's not handled properly in the bioperl swissprot (and possibly other) parsers. I can help fixing the parser and if necessary Bio::Species, but I feel this needs input from an expert. Virus researchers and taxonomists out there, this is a call for you to de-lurk and volunteer.
-hilmar
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------