[Bioperl-l] swissprot is a pain - partly due to bioperl

Ewan Birney birney@ebi.ac.uk
Sat, 5 Oct 2002 15:16:53 +0100 (BST)


On Fri, 4 Oct 2002, Hilmar Lapp wrote:

> I thought I take on the smoothest ride first by dumping swissprot (rel. 40) into biosql. This turned out to be painful, and here's why and what I did and what else we need to do to relieve some of the pain.
>
> I'm writing this line at the end because it became a long email. So I thought I better put a summary here.
>
> <summary>
> - Species and classification names choke the parser. I have a solution to fix the problem once and for all.

great. I approve. I also was doing this.

> - Swissprot entry to species is a n:n relationship. The parser screwed up the species. I have a short-term fix, but generally speaking this is a total nightmare.

It is a long standing gripe i have with swissprot and I doubt
they are going to change their spots in a hurry. I consider this to
be "insane" but can't talk swissprot out of this.

> - 'Common name' isn't always a common name, but sometimes a strain or isolate, which is crucial for identifying the species. I propose a solution.
> - Virus classification scheme is not handled properly, and I don't know how it should be. Need an expert.
> </summary>
>
> Read on to share my pain.
>
> 1) Species names not conforming to what we think in Bioperl they should should conform to. There are endless variants with ever new non-letter characters being used even in species name, especially for viruses and bacteria. What's really painful about this is that our name validators throw an exception (Elia, you were so right) and the parser chokes.
>
> I honestly see no point in us trying to keep up with the fancy names of viruses and bacteria classifications, if in the end we have to trust the sources anyway. So, I decided to fix this problem once and for all by doing exactly that: $species->classification() in addition to the traditional array of strings will now also accept another form of being called in set mode: if the first argument is a reference to an array, the second argument is checked whether it evaluates to true. If it does, no name validation whatsoever is done. I.e., 'trust the caller.' I modified the swissprot parser accordingly. I.e., trust swissprot species and classification names, however weird they may read.
>
> It works for me. Does anyone have a problem with me committing this? I also suggest that we modify the genbank, embl, etc parsers accordingly.
>
> 2) The swissprot people in their quest for non-redundancy apparently
> collapsed sequences which are the same for multiple species into one
> entry. This means, Species to Entry can be a n:n relationship. This not
> only caused the bioperl parser to silently (!) screw up the species, it
> also violates the bioperl object model. I fixed the screw-up such that
> at least the 'main species' (the one matching the ID division) is
> correct. At this point I ignore the other species.
>
> One solution to this could be to add get_secondary_species() to
> Bio::Seq::RichSeqI. What are people's thoughts on this? Does anyone have
> an objection to me committing my fix?

Yuk, but probably nice to do.

>
> Once we decided how to keep the other species, we could at least dump
> swissprot read from swissprot without losing significant information.
> But there's more trouble buried here: if you take a gene-centric
> viewpoint, you get one entry where you should have gotten 5, because the
> gene is present not in one species but in five, no matter how similar
> the sequence is. (In fact, if you search swissprot through the SRS
> gateway for YWHAE, you might think only humans have this gene. You have
> to visit and study the entry to find out that actually mouse etc have it
> too. For them, the division will disagree with the species. I guess the
> swissprot people have good reasons why they do this.) Because of this I
> strongly vote for not changing the fundamental bioperl object model (a
> sequence has one species); IMHO normalizing by protein sequence is a bad
> idea except for similarity searches.

Yes. Yes.

>
> 3) Identifiability of a species. (Full) Binomial is not enough as it
> turns out, as for microorganisms different strains and/or isolates get
> different NCBI_TaxIDs. Also, the term in parentheses on the OS line in
> these cases does not indicate a common name (which is supposedly
> redundant with the binomial in terms of identifiability), but the name
> of the strain or isolate, and then therefore is a key part of the
> species' name (i.e., it's semantically overloaded). I propose the
> following to fix this.
>
> 	- add an attribute variant() to Bio::Species, holding the
> un-interpreted value in parentheses if it appears not to be the common
> name. (e.g. 'isolate Gambia', 'PYSG', or 'strain PSG').
>
> 	- pass the value in parentheses either to variant() to
> common_name(), depending on some magical logic ...


Sounds good.

>
> 4) Virus classificatio. This is a whole other nightmare, and I'm not
> going to delve into it. But it's not handled properly in the bioperl
> swissprot (and possibly other) parsers. I can help fixing the parser and
> if necessary Bio::Species, but I feel this needs input from an expert.
> Virus researchers and taxonomists out there, this is a call for you to
> de-lurk and volunteer.

;).


>
> 	-hilmar
> --
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>