[Bioperl-l] bioperl-db

Elia Stupka elia@fugu-sg.org
Thu, 14 Mar 2002 11:42:11 +0800 (SGT)

Hey Chris again,

> One problem is that updating an entire database takes as long as - or
> longer than - a fresh load. The loader script doesn't know if an entry
> needs updated or not, so it goes ahead and updates anyway. BioSQL is very
> entry-centric so we could take advantage of that.

I was thinking about something very simple that could help (Elia - down to
earth element in a sea of AI-wizkids :) )... why not have a flag on
load_seqdatabase for fresh loads that would then trigger zero paranoya in
the object layer? In other words if I know I am loading a sane swissprot
file for the first time in biosql, there is no need to check for any
existing data, right? Ah, uh, probably wrong for qualifiers and so on, but
at least for bioentries and features...


> What if we used the Bio::Index/Fetch stuff rather than a Bio::SeqIO loop,
> cache the md5checksum of each entry either in the db or a seperate file,

As for all other projects I think having the checksums in can only help,
whatever you are doing (generic statement of the week award). However in
the case of annotated entries we would need to make sure we somehow
checksum not just sequence, but features too... or were you simply
thinking of checksumming the text of the entry?

> md5 is pretty fast, we could speed it up more by only checking md5 if the
> entry byte lengths differ.

I think you meant the entry text, yes, makes really good sense...
...but I like the fact of using SeqIO... any way we can do it via SeqIO
despite the fact it reads things line by line? guess not, ha?

> Another option is to slice the input fasta up across a cluster and have
> every node bang on the database in parallel. Postgresophiles say pg scales
> better this way but the ensembl pipeline seems to say otherwise.

God, why didn't I think of this earlier, doh! We do this for most of our
scripts by now, even when we are just datamining, calculating stats, we
always splat things across the farm... I'll try to think of an elegant way
of doing this, ideally you would like it to be transparent to the user,
who would just say perl load_seqdatabase swissprot sprot.dat -lsf 1

One way of doing it would be to leave the data on the mysql host and
simply (if you have LSF) send bsub commands with chunks of data to all
other hosts.

> If that's too slow, what about a local "bulkload" mysql db on every farm
> node, splice the flatfiles, load up a bunch of independent dbs, when
> that's done, load the main central one with some fast "INSERT INTO
> main.tbl AS SELECT FROM nodeX.tbl" statements.

I think it's complete overkill because of the extra checks you need to do
but if you did it that way the best way, alas, in terms of speed would be
mysqldump > mysqlimport

> platonic production rules. The decoupling is the important part. We could
> even do the parsing / event firing in C. Parsing in C may seem a bit
> mental when you're used to perl regexp magic, but maybe being forced to
> less magic would be a good thing.

Again, I think this would be the only decent way... more coding work, but
get some C parsers and use those...

> So you'll have all that done and the bioSQL mirror ready by, ooooh, next
> tuesday yeh Elia?

Sure, no problem, 2004! :)


* http://www.fugu-sg.org/~elia *
* tel:    +65 874 1467         *
* mobile: +65 90307613         *
* fax:    +65 777 0402         *