[Bioperl-l] Re: Bioperl-DB error

Thu Mar 4 16:10:06 EST 2004

On Thursday, March 4, 2004, at 12:16  PM, Law, Annie wrote:

>
> I have bioperl installed in two places.  I think it should be okay  
> since the
> Errors that I am receiving have reduced.  I not sure how to get rid of  
> the
> initial install or if it matters.

Brian may be the better person to answer this, but if you just remove  
the Bio/ directory where it got installed then you should have purged  
most of bioperl (and bioperl-db too). There's a few files more that get  
installed into the root directory for site libraries/modules in perl,  
but if you're uncomfortable picking them then there's no harm really.

It'd actually be useful to have an 'uninstall' command. Any comments or  
ideas anybody has had on that already? Maybe there already is and I'm  
just missing it?

As for having two bioperls around and one of them is installed, be  
careful with that. I just recently found myself duped by believing that  
in this situation if you set PERL5LIB to point to your uninstalled  
bioperl perl will pick up the uninstalled version instead of the  
installed. This is not true - it will pick up the installed one if it  
can find one. To discern from where it is going to take the bioperl  
modules you'll have to use the -I option to perl.

Unless somebody else has a better suggestion ...

>
>  Should
> I be concerned about this warning?
>
> t/cluster.......ok 155/160
> -------------------- WARNING ---------------------
> MSG: failed to store one or more child objects for an instance of class
> Bio::Cluster::UniGene (PK=1)
> ---------------------------------------------------
>
>

No, don't worry. The warning reflects the fact that one of the  
associations failed to insert because it was already there. I turned  
off the DBI warnings on failed statements, so you don't see that  
precede a bioperl-style warning anymore.

>
> All tests successful.
> Files=15, Tests=930, 93 wallclock secs (31.64 cusr +  1.26 csys =  
> 32.90 CPU)
>

That's the important message to watch out for.

> 2) I started with an empty database and I loaded the NCBI taxonomy  
> into the
> database, then the GO information then locuslink (I used
> The --safe and --lookup options and did not get any errors).  Loading
> locuslink information to half a day to a day.

Sounds about what I get. The script can report progress every n entries  
using the --logchunk option, check out the POD. It will report some  
speed statistics along.

>
> Next I loaded unigene information and it is taking three days and is  
> still
> not finished?! (I meant to use the option --mergeobjs but forgot when  
> I ran
> the script but I don't think that this will make a difference in terms  
> of
> execution time)

Human and mouse unigene releases take a *very* long time to load, each  
takes about 2-2.5 days for me. The thing with unigene that makes it  
very heavy is that every cluster member becomes its own bioentry (as it  
is a sequence - but it won't have an actual sequence associated as the  
unigene SeqIO parser currently doesn't use the file that would contain  
them). So, when it finishes you will find several million bioentries,  
even though there are already 120k clusters. If you turn on progress  
reporting you'll see that the speed also varies widely depending on  
where in the file the script is. The reason is that some cluster are  
*huge* with tens of thousands of members. Such a single cluster will  
lead to tens of thousands of bioentries being created for it alone.

When you load unigene the first time it doesn't really matter which  
combination you use for merging or look-up since none will be found  
anyway. Once you update though, and you don't purge Unigene from the  
database before, this is critical, because otherwise a large cluster  
will not just take a long time to insert, but also to look up. A very  
useful combination of options for unigene is --flatlookup (see the POD)  
and --mergeobjs="del-assocs-sql.pl". The latter will delete the  
associations partially using SQL instead of traversing object trees and  
deleting one association at a time. (It is therefore also specific to  
the schema version I wrote it for, which is Oracle, but can be ported  
easily to work with the mysql and pg versions. Let me know once you are  
there.)

Also, what to keep in mind is that sequence loading can be trivially  
parallelized by firing up different load_seqdatabase's in parallel on  
different targets simultaneously. How many parallel processes are  
beneficial depends on your number of CPUs and what your RDBMS can  
handle. (e.g. I always run human and mouse unigene in parallel)

Also, you can run Unigene and LL in parallel, if you've got the  
necessary CPU power on the db-server end.

>  Something strange that is happening is that when I use
> mysqlcc to refresh what the current state of the database is the  
> number of
> bioentries increases and then decreases.

I don't know about mysqlcc and hence can't comment ...

> I am wondering if loading of unigene information is affected by the  
> warning
> I got from the make test in bioperl-db

No, it's not, don't worry.

> 3) On a related note every time including the initial time that I load  
> the
> database with load_ontology.pl, or load_seqdatbase.pl I use the option
> lookup.  I want to create annotation database so I am loading first  
> with
> NCBI taxonomy databse, then GO, then locuslink, then unigene.  I want  
> find
> out the annotation for some clones from ESTs.  I think I only need to  
> use
> the mergobjs option when I load the unigene information?

If the database is empty or doesn't contain yet the datasource you're  
trying to load then there is no point in using --lookup (or  
--mergeobjs). In fact, it'll hurt, although really not much -- what  
will happen is that for every bioentry a look-up (SELECT query) is  
generated which you already know won't return anything. The lookup is  
fast though - should be at or below 0.01 seconds, but again that may  
depend on the load on the database.

As for --mergeobjs, there are a couple implementations provided in the  
repository. Check those out to see how it works, e.g.  
update-on-new-date.pl, update-on-new-version.pl, freshen-annot.pl, and  
merge-unique-ann.pl. It's not difficult to write your own if you have  
other requirements.

> I read the documentation of mergobjs in the load_seqdatbase.pl script  
> but
> would appreciate an example of how this works.
>
> 4) Also after I intalled Bioperl-1.4 again I decreased the number of  
> failed
> tests but still there Where 3 test failures and I am not sure if I  
> should be
> concerned about these.
>
> Failed Test  Stat Wstat Total Fail  Failed  List of Failed
> ----------------------------------------------------------------------- 
> -----
> ---
> t/AlignIO.t     2   512    80    5   6.25%  76-80
> t/DB.t                     78   ??       %  ??
> t/tutorial.t  255 65280    21   11  52.38%  11-21
> 122 subtests skipped.
> Failed 3/180 test scripts, 98.33% okay. 7/8305 subtests failed, 99.92%  
> okay.
>

I'll leave this one to Brian, Heikki, Jason, or whoever can provide  
insight.

>
> 5) I have been loading the database with the following command. I  
> would like
> to know if the format option should be SeqIO:locuslink or should it  
> simply
> be locuslink?
>

Either will work. SeqIO::locuslink is more precise, but since SeqIO is  
(and will remain) the default, locuslink will work too. Only for  
ClusterIO formats you have to prepend the ClusterIO:: to the name of  
the format (e.g. unigene).

Hth,

	-hilmar

> Perl /root/bioperl-db/scripts/biosql/load_seqdatabase.pl --dbuser=root
> --dbpass=ms22a --dbname=annotatedata --namespace LocusLink
> --format SeqIO::locuslink /var/lib/mysql/LL_tmpl
>
>
> thanks very much,
> Annie.
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------