[BioSQL-l] Special cases of protein data
Hilmar Lapp
hlapp at gmx.net
Wed Aug 24 06:24:33 EDT 2005
I bet all sequences you found that have multiple species assigned are
from Swissprot. At least the example is. Note that this (multiple
species per entry) is a pathological case artificially created by
Swissprot in its attempt to normalize (collapse) by sequence; this
creates a number of - sometimes amusing, sometimes just plain annoying
- problems, like the simple question why would S. flexneri have a gene
named YFDV_ECOLI
(http://www.pir.uniprot.org/cgi-bin/upEntry?id=YFDV_ECOLI), and similar
naming questions which I guess in the case of Bacteria are sort of not
very controversial but for eukaryotes can lead to bizarre situations.
Also, it precipitates some nasty and rather arcane conventions for the
GN lines etc. This has been discussed several times in the past several
years on the bioperl mailing list,
http://portal.open-bio.org/pipermail/bioperl-l/2002-October/009687.html
is one example for a thread.
At any rate, supposedly UniProt did away with this, but apparently not
completely for Bacteria? At least for eukaryotic proteins, sequences
are now duplicated in UniProt for each species that has the gene
(protein) even if the protein sequence is exactly the same (e.g.,
http://www.pir.uniprot.org/cgi-bin/upEntry?id=CALM_HUMAN and
http://www.pir.uniprot.org/cgi-bin/upEntry?id=CALM_MOUSE). UniRef100
will obviously be non-redundant like before (e.g.
http://www.pir.uniprot.org/cgi-bin/upEntry?id=UniRef100_P62158) , but
Biosql isn't meant to be your non-redundant Blast database.
Bottom line: multiple taxa for a single bioentry complicates matters a
lot for many use-cases, is not supported by, e.g., bioperl anyway, and
is pathologic for all cases except truly chimeric sequences. I'm not in
favor of accommodating pathologic data models in Biosql to be honest
...
-hilmar
On Aug 24, 2005, at 1:11 AM, Andreas Dräger wrote:
> Dear BioSQL-developers,
>
> I am currently working with BioSQL using MySQL. I tried to insert a
> lot of
> protein data which were downloaded from the NCBI web page in GenPept
> format.
> During the insertion process (performed by BioJava) I got some error
> messages. Looking at the sequences in detail showed that I got more
> than
> 1000 protein sequences which had at least two "source" entries in
> theire
> "FEATURE" table. One of these bad examples is given at NCBI by the
> accession
> number P76519. This one has even four "source" tags. In my opinion this
> means that every single species of the four given species contains
> exactly
> this protein. This would mean that there are at least these one
> thousand
> proteins that I found at NCBI belonging to more than one species. This
> case
> cannot be considered with the current BioSQL scheme because there is a
> one
> to many relationship between the tables bioentry and taxon. To
> consider that
> the same protein belongs to n taxa we would need to create another
> table to
> reflect a many to many relationship between the table taxon and
> bioentry.
> The foreign key constraint of bioentry to taxon would have to be
> removed.
> The resuld would be something like:
>
> bioentry <--> taxon_bioentry <--> taxon
>
> where taxon_bioentry is the extra table. This is just what I was
> thinking
> about. However, at the moment I cannot insert files like P76519 into
> the
> BioSQL database. Or am I wrong and the meaning of more than one
> "source" tag
> is somehow different?
> I am looking forward to get any suggestions.
>
> Yours Andreas Dräger
>
> --
> 5 GB Mailbox, 50 FreeSMS http://www.gmx.net/de/go/promail
> +++ GMX - die erste Adresse für Mail, Message, More +++
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the BioSQL-l
mailing list