[Bioperl-l] Bioperl-db doesn't seem to load all entries

Hilmar Lapp hlapp at gmx.net
Thu Dec 7 03:20:14 UTC 2006


I seriously doubt that load_seqdatabase.pl would have deliberately  
stopped loading the file. Either there was an error in loading an  
entry (which you should see, and you can also ask the script to just  
keep going by providing the --safe option), or the file only  
contained 1003 entries.

Note that you can get progress logging by using the --logchunk  
option, which will also give you a final count of the number of  
sequences loaded.

I'm not sure how you ran your search and your download on Uniprot. If  
I try what you describe I get 70491 hits, and if I try to export them  
using the data set manager I get the message:

This download mechanism only supports 1000 proteins. The first 1000  
proteins have been added from the selected.

Which perfectly explains what you see.

Did you convince yourself that the file contains 70491 entries? If  
you don't have grep and wc on your windows machine, you can use perl  
one-liners directly, e.g.,

perl -n -e '/^ID / && ++$n; END {print "$n entries\n";}' <your-file- 
here>

	-hilmar

On Dec 4, 2006, at 5:34 PM, pelikan at cs.pitt.edu wrote:

> Hello,
>
>     My system is running bioperl 1.5.2, bioperl-db 1.5.2-005 RC,  
> and the
> latest mySQL under Windows, Activeperl, without Cygwin. I have 4 GB
> memory. "make test"s past fine.
>
> The problem is that I'm not getting similar numbers of anything when I
> load datasets using load_seqdatabase.pl. For instance, if I want to  
> load
> only protiens from Homo Sapiens,
> I go to UniProt,
> use the database search function,
> do a text search for Homo Sapiens (returns 70914 hits),
> export the hits to flat file format (--format swiss) using the data  
> set
> manager,
> and load it using load_seqdatabase.pl.
>
> The result of  "select count(*) from bioentry;" results in only  
> 1003 entries.
> Moreover it seems like the entries don't go past the B's in the  
> alphabet -
> I can't find bioentry.descriptions like '%cytochrome%' or '% 
> myoglobin%',
> but I can find apolipoproteins, for example.
>
> I know this is an annoying question, but if someone has more  
> experience in
> dealing with this issue, I would be grateful for any assistance. I  
> don't
> get any error messages, so it's difficult for me to tell what's  
> going on.
>
> -Richard
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================








More information about the Bioperl-l mailing list