[BioSQL-l] Re: GO dbxrefs in swissprot
Hilmar Lapp
hlapp at gnf.org
Tue Jul 6 15:17:10 EDT 2004
On Jul 6, 2004, at 6:14 AM, Andreas Henschel wrote:
> Sorry for having bothered you with versioning, I simply trusted the
> biosql installation instructions that claimed a patched 1.2.1 would
> do.
Sorry - the documentation needs to be updated.
> What still puzzles me is the size of the database: starting with a
> 543MB flatfile, the first run (with the faulty parser) gave me 600MB
> database and 9100 GO annotations. After the rerun with
> load_seqdatabase (...) --lookup --remove I get 1.1GB database but
> only 5100 GO annotations in the dbxref table. Is this due to the
> normalization?
I'm confused. Did you start with a scratch biosql instance, or did you
re-use the one loaded with swissprot before?
If re-loading an existing one, the number of rows in dbxref should
*not* go down, regardless of what you do to bioentries. The number of
rows in the association table bioentry_dbxref will be affected though.
Did you do a grep on the GO dbxrefs in the swissprot files followed by
sort unique? How many did you get? You should have at least as many
rows in dbxref.
If you find a discrepancy, i.e., if you can identify a GO dbxref that's
present in your swissprot file but not in the database, check out an
entry that is (or should be) associated with that dbxref.
> Is there a full list of parseable databases (GenBank, EMBL, ENSEMBL?,
> PDB? etc) and the resp. place to download?
This list is more or less identical with the list of formats readable
by the Bio::SeqIO system in bioperl, because this is what
load_seqdatabase.pl uses for parsing files. Genbank and Embl is among
those formats. Ensembl used to come in an Embl-formatted flatfile dump,
but I don't know whether it still does.
Note that without any post-processing the bioentries resulting from a
file upload will represent the entries found in the source file. E.g.,
if the source file contains an annotated whole chromosome entry, that's
what you'll get (but not necessarily want) in biosql as well. As an
example for integrated post-processing, I used to use a
Bio::Factory::SequenceProcessorI implementation to split Ensembl whole
chromosomes into predicted genes, transcripts, and proteins, which
would then get loaded into biosql. (check out the documentation for the
--pipeline option in load_seqdatabase.pl for how to make the script
invoke a given post-processor)
-hilmar
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the BioSQL-l
mailing list