[Bioperl-l] Re: GO dbxrefs in swissprot

Tue Jul 6 15:17:10 EDT 2004

On Jul 6, 2004, at 6:14 AM, Andreas Henschel wrote:

> Sorry for having bothered you with versioning, I simply trusted the 
> biosql installation instructions that claimed a patched 1.2.1 would 
> do.

Sorry - the documentation needs to be updated.

> What still puzzles me is the size of the database: starting with a 
> 543MB flatfile, the first run (with the faulty parser) gave me 600MB 
> database and 9100 GO annotations. After the rerun with 
> load_seqdatabase (...) --lookup --remove  I get 1.1GB database but 
> only 5100 GO annotations in the dbxref table. Is this due to the 
> normalization?

I'm confused. Did you start with a scratch biosql instance, or did you 
re-use the one loaded with swissprot before?

If re-loading an existing one, the number of rows in dbxref should 
*not* go down, regardless of what you do to bioentries. The number of 
rows in the association table bioentry_dbxref will be affected though.

Did you do a grep on the GO dbxrefs in the swissprot files followed by 
sort unique? How many did you get? You should have at least as many 
rows in dbxref.

If you find a discrepancy, i.e., if you can identify a GO dbxref that's 
present in your swissprot file but not in the database, check out an 
entry that is (or should be) associated with that dbxref.

> Is there a full list of parseable databases (GenBank, EMBL, ENSEMBL?, 
> PDB? etc) and the resp. place to download?

This list is more or less identical with the list of formats readable 
by the Bio::SeqIO system in bioperl, because this is what 
load_seqdatabase.pl uses for parsing files. Genbank and Embl is among 
those formats. Ensembl used to come in an Embl-formatted flatfile dump, 
but I don't know whether it still does.

Note that without any post-processing the bioentries resulting from a 
file upload will represent the entries found in the source file. E.g., 
if the source file contains an annotated whole chromosome entry, that's 
what you'll get (but not necessarily want) in biosql as well. As an 
example for integrated post-processing, I used to use a 
Bio::Factory::SequenceProcessorI implementation to split Ensembl whole 
chromosomes into predicted genes, transcripts, and proteins, which 
would then get loaded into biosql. (check out the documentation for the 
--pipeline option in load_seqdatabase.pl for how to make the script 
invoke a given post-processor)

	-hilmar
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------