[Bioperl-l] Re: Bioperl-DB error
Hilmar Lapp
hlapp at gnf.org
Thu Mar 4 16:10:06 EST 2004
On Thursday, March 4, 2004, at 12:16 PM, Law, Annie wrote:
>
> I have bioperl installed in two places. I think it should be okay
> since the
> Errors that I am receiving have reduced. I not sure how to get rid of
> the
> initial install or if it matters.
Brian may be the better person to answer this, but if you just remove
the Bio/ directory where it got installed then you should have purged
most of bioperl (and bioperl-db too). There's a few files more that get
installed into the root directory for site libraries/modules in perl,
but if you're uncomfortable picking them then there's no harm really.
It'd actually be useful to have an 'uninstall' command. Any comments or
ideas anybody has had on that already? Maybe there already is and I'm
just missing it?
As for having two bioperls around and one of them is installed, be
careful with that. I just recently found myself duped by believing that
in this situation if you set PERL5LIB to point to your uninstalled
bioperl perl will pick up the uninstalled version instead of the
installed. This is not true - it will pick up the installed one if it
can find one. To discern from where it is going to take the bioperl
modules you'll have to use the -I option to perl.
Unless somebody else has a better suggestion ...
>
> Should
> I be concerned about this warning?
>
> t/cluster.......ok 155/160
> -------------------- WARNING ---------------------
> MSG: failed to store one or more child objects for an instance of class
> Bio::Cluster::UniGene (PK=1)
> ---------------------------------------------------
>
>
No, don't worry. The warning reflects the fact that one of the
associations failed to insert because it was already there. I turned
off the DBI warnings on failed statements, so you don't see that
precede a bioperl-style warning anymore.
>
> All tests successful.
> Files=15, Tests=930, 93 wallclock secs (31.64 cusr + 1.26 csys =
> 32.90 CPU)
>
That's the important message to watch out for.
> 2) I started with an empty database and I loaded the NCBI taxonomy
> into the
> database, then the GO information then locuslink (I used
> The --safe and --lookup options and did not get any errors). Loading
> locuslink information to half a day to a day.
Sounds about what I get. The script can report progress every n entries
using the --logchunk option, check out the POD. It will report some
speed statistics along.
>
> Next I loaded unigene information and it is taking three days and is
> still
> not finished?! (I meant to use the option --mergeobjs but forgot when
> I ran
> the script but I don't think that this will make a difference in terms
> of
> execution time)
Human and mouse unigene releases take a *very* long time to load, each
takes about 2-2.5 days for me. The thing with unigene that makes it
very heavy is that every cluster member becomes its own bioentry (as it
is a sequence - but it won't have an actual sequence associated as the
unigene SeqIO parser currently doesn't use the file that would contain
them). So, when it finishes you will find several million bioentries,
even though there are already 120k clusters. If you turn on progress
reporting you'll see that the speed also varies widely depending on
where in the file the script is. The reason is that some cluster are
*huge* with tens of thousands of members. Such a single cluster will
lead to tens of thousands of bioentries being created for it alone.
When you load unigene the first time it doesn't really matter which
combination you use for merging or look-up since none will be found
anyway. Once you update though, and you don't purge Unigene from the
database before, this is critical, because otherwise a large cluster
will not just take a long time to insert, but also to look up. A very
useful combination of options for unigene is --flatlookup (see the POD)
and --mergeobjs="del-assocs-sql.pl". The latter will delete the
associations partially using SQL instead of traversing object trees and
deleting one association at a time. (It is therefore also specific to
the schema version I wrote it for, which is Oracle, but can be ported
easily to work with the mysql and pg versions. Let me know once you are
there.)
Also, what to keep in mind is that sequence loading can be trivially
parallelized by firing up different load_seqdatabase's in parallel on
different targets simultaneously. How many parallel processes are
beneficial depends on your number of CPUs and what your RDBMS can
handle. (e.g. I always run human and mouse unigene in parallel)
Also, you can run Unigene and LL in parallel, if you've got the
necessary CPU power on the db-server end.
> Something strange that is happening is that when I use
> mysqlcc to refresh what the current state of the database is the
> number of
> bioentries increases and then decreases.
I don't know about mysqlcc and hence can't comment ...
> I am wondering if loading of unigene information is affected by the
> warning
> I got from the make test in bioperl-db
No, it's not, don't worry.
> 3) On a related note every time including the initial time that I load
> the
> database with load_ontology.pl, or load_seqdatbase.pl I use the option
> lookup. I want to create annotation database so I am loading first
> with
> NCBI taxonomy databse, then GO, then locuslink, then unigene. I want
> find
> out the annotation for some clones from ESTs. I think I only need to
> use
> the mergobjs option when I load the unigene information?
If the database is empty or doesn't contain yet the datasource you're
trying to load then there is no point in using --lookup (or
--mergeobjs). In fact, it'll hurt, although really not much -- what
will happen is that for every bioentry a look-up (SELECT query) is
generated which you already know won't return anything. The lookup is
fast though - should be at or below 0.01 seconds, but again that may
depend on the load on the database.
As for --mergeobjs, there are a couple implementations provided in the
repository. Check those out to see how it works, e.g.
update-on-new-date.pl, update-on-new-version.pl, freshen-annot.pl, and
merge-unique-ann.pl. It's not difficult to write your own if you have
other requirements.
> I read the documentation of mergobjs in the load_seqdatbase.pl script
> but
> would appreciate an example of how this works.
>
> 4) Also after I intalled Bioperl-1.4 again I decreased the number of
> failed
> tests but still there Where 3 test failures and I am not sure if I
> should be
> concerned about these.
>
> Failed Test Stat Wstat Total Fail Failed List of Failed
> -----------------------------------------------------------------------
> -----
> ---
> t/AlignIO.t 2 512 80 5 6.25% 76-80
> t/DB.t 78 ?? % ??
> t/tutorial.t 255 65280 21 11 52.38% 11-21
> 122 subtests skipped.
> Failed 3/180 test scripts, 98.33% okay. 7/8305 subtests failed, 99.92%
> okay.
>
I'll leave this one to Brian, Heikki, Jason, or whoever can provide
insight.
>
> 5) I have been loading the database with the following command. I
> would like
> to know if the format option should be SeqIO:locuslink or should it
> simply
> be locuslink?
>
Either will work. SeqIO::locuslink is more precise, but since SeqIO is
(and will remain) the default, locuslink will work too. Only for
ClusterIO formats you have to prepend the ClusterIO:: to the name of
the format (e.g. unigene).
Hth,
-hilmar
> Perl /root/bioperl-db/scripts/biosql/load_seqdatabase.pl --dbuser=root
> --dbpass=ms22a --dbname=annotatedata --namespace LocusLink
> --format SeqIO::locuslink /var/lib/mysql/LL_tmpl
>
>
> thanks very much,
> Annie.
>
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------
More information about the Bioperl-l
mailing list