[Biojava-l] RE: [BioSQL-l] Special cases of protein data

Richard HOLLAND hollandr at gis.a-star.edu.sg
Wed Aug 24 04:45:02 EDT 2005

I've come across this same problem. 

The source features only relate to the location they specify. The sequence itself is always defined as coming from a single organism, further up in the headers of the file under the SOURCE/ORGANISM pairing. That organism is the one that should be referenced from bioentry.

However, it does not help us much in BioSQL. The SOURCE/ORGANISM field only describes in text the organism. It doesn't provide an NCBI Taxon ID. So, we can't auto-generate missing organisms in the NCBI taxon table, and so we can't use this field to determine the species of the organism (unless we can guarantee the whole of the NCBI taxonomy tree has been preloaded into the database).

The new BioJava Genbank parser we are working on (to be announced soon) uses the taxon ID from the first /dbxref="taxon:..." tag of the first feature as the source organism, and assigns the organism name from the SOURCE/ORGANISM headings to this taxon ID, and emits warnings if it finds other taxon IDs further down. It would be simple enough to change this to depend on a preloaded taxonomy database, but I hate introducing dependencies like that. 

Would such a required dependency be justified for the sake of correct parsing of multiple sources?


Richard Holland
Bioinformatics Specialist
GIS extension 8199
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you.

> -----Original Message-----
> From: biosql-l-bounces at portal.open-bio.org 
> [mailto:biosql-l-bounces at portal.open-bio.org] On Behalf Of 
> "Andreas Dräger"
> Sent: Wednesday, August 24, 2005 4:11 PM
> To: biosql-l at open-bio.org
> Subject: [BioSQL-l] Special cases of protein data
> Dear BioSQL-developers,
> I am currently working with BioSQL using MySQL. I tried to 
> insert a lot of
> protein data which were downloaded from the NCBI web page in 
> GenPept format.
> During the insertion process (performed by BioJava) I got some error
> messages. Looking at the sequences in detail showed that I 
> got more than
> 1000 protein sequences which had at least two "source" 
> entries in theire
> "FEATURE" table. One of these bad examples is given at NCBI 
> by the accession
> number P76519. This one has even four "source" tags. In my 
> opinion this
> means that every single species of the four given species 
> contains exactly
> this protein. This would mean that there are at least these 
> one thousand
> proteins that I found at NCBI belonging to more than one 
> species. This case
> cannot be considered with the current BioSQL scheme because 
> there is a one
> to many relationship between the tables bioentry and taxon. 
> To consider that
> the same protein belongs to n taxa we would need to create 
> another table to
> reflect a many to many relationship between the table taxon 
> and bioentry.
> The foreign key constraint of bioentry to taxon would have to 
> be removed.
> The resuld would be something like:
> bioentry <--> taxon_bioentry <--> taxon
> where taxon_bioentry is the extra table. This is just what I 
> was thinking
> about. However, at the moment I cannot insert files like 
> P76519 into the
> BioSQL database. Or am I wrong and the meaning of more than 
> one "source" tag
> is somehow different?
> I am looking forward to get any suggestions.
> Yours Andreas Dräger
> -- 
> 5 GB Mailbox, 50 FreeSMS http://www.gmx.net/de/go/promail
> +++ GMX - die erste Adresse für Mail, Message, More +++
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l

More information about the Biojava-l mailing list