[BioSQL-l] Special cases of protein data

Wed Aug 24 06:24:33 EDT 2005

I bet all sequences you found that have multiple species assigned are 
from Swissprot. At least the example is. Note that this (multiple 
species per entry) is a pathological case artificially created by 
Swissprot in its attempt to normalize (collapse) by sequence; this 
creates a number of - sometimes amusing, sometimes just plain annoying 
- problems, like the simple question why would S. flexneri have a gene 
named YFDV_ECOLI 
(http://www.pir.uniprot.org/cgi-bin/upEntry?id=YFDV_ECOLI), and similar 
naming questions which I guess in the case of Bacteria are sort of not 
very controversial but for eukaryotes can lead to bizarre situations. 
Also, it precipitates some nasty and rather arcane conventions for the 
GN lines etc. This has been discussed several times in the past several 
years on the bioperl mailing list,
http://portal.open-bio.org/pipermail/bioperl-l/2002-October/009687.html 
is one example for a thread.

At any rate, supposedly UniProt did away with this, but apparently not 
completely for Bacteria? At least for eukaryotic proteins, sequences 
are now duplicated in UniProt for each species that has the gene 
(protein) even if the protein sequence is exactly the same (e.g., 
http://www.pir.uniprot.org/cgi-bin/upEntry?id=CALM_HUMAN and 
http://www.pir.uniprot.org/cgi-bin/upEntry?id=CALM_MOUSE). UniRef100 
will obviously be non-redundant like before (e.g. 
http://www.pir.uniprot.org/cgi-bin/upEntry?id=UniRef100_P62158) , but 
Biosql isn't meant to be your non-redundant Blast database.

Bottom line: multiple taxa for a single bioentry complicates matters a 
lot for many use-cases, is not supported by, e.g., bioperl anyway, and 
is pathologic for all cases except truly chimeric sequences. I'm not in 
favor of accommodating pathologic data models in Biosql to be honest 
...

	-hilmar

On Aug 24, 2005, at 1:11 AM, Andreas Dräger wrote:

> Dear BioSQL-developers,
>
> I am currently working with BioSQL using MySQL. I tried to insert a 
> lot of
> protein data which were downloaded from the NCBI web page in GenPept 
> format.
> During the insertion process (performed by BioJava) I got some error
> messages. Looking at the sequences in detail showed that I got more 
> than
> 1000 protein sequences which had at least two "source" entries in 
> theire
> "FEATURE" table. One of these bad examples is given at NCBI by the 
> accession
> number P76519. This one has even four "source" tags. In my opinion this
> means that every single species of the four given species contains 
> exactly
> this protein. This would mean that there are at least these one 
> thousand
> proteins that I found at NCBI belonging to more than one species. This 
> case
> cannot be considered with the current BioSQL scheme because there is a 
> one
> to many relationship between the tables bioentry and taxon. To 
> consider that
> the same protein belongs to n taxa we would need to create another 
> table to
> reflect a many to many relationship between the table taxon and 
> bioentry.
> The foreign key constraint of bioentry to taxon would have to be 
> removed.
> The resuld would be something like:
>
> bioentry <--> taxon_bioentry <--> taxon
>
> where taxon_bioentry is the extra table. This is just what I was 
> thinking
> about. However, at the moment I cannot insert files like P76519 into 
> the
> BioSQL database. Or am I wrong and the meaning of more than one 
> "source" tag
> is somehow different?
> I am looking forward to get any suggestions.
>
> Yours Andreas Dräger
>
> -- 
> 5 GB Mailbox, 50 FreeSMS http://www.gmx.net/de/go/promail
> +++ GMX - die erste Adresse für Mail, Message, More +++
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------